From hzmao at hotmail.com Mon Sep 4 22:42:40 2017 From: hzmao at hotmail.com (hanzi mao) Date: Tue, 5 Sep 2017 02:42:40 +0000 Subject: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder Message-ID: Hi, I am researching on the source code of DecisionTree recently. Here are the things I tried. 1. Downloaded source code from github. 2. run "python setup.py build_ext --inplace" to compile the sources in the unzipped source folder. 3. Try the following codes to see whether it works. Here I changed the name of the sklearn folder to sklearn1 to differentiate it from the one installed. >>> from sklearn1 import tree >>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(iris.data, iris.target) Traceback (most recent call last): File "", line 1, in File "sklearn1\tree\tree.py", line 790, in fit X_idx_sorted=X_idx_sorted) File "sklearn1\tree\tree.py", line 341, in fit self.presort) TypeError: Argument 'criterion' has incorrect type (expected sklearn.tree._criterion.Criterion, got sklearn.tree._criterion.Gini) Then a weird error happened. Actually I also tried the newest stable version of scikit-learn earlier today. It had the same error. So I was thinking maybe try the newest version in github might help. Unlikely, it didn't. I have limited knowledge about the source code of scikit-learn. I am wondering if anyone could help me with this. Thanks! Best, Hanna -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Sep 4 23:21:01 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 5 Sep 2017 13:21:01 +1000 Subject: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder In-Reply-To: References: Message-ID: I suspect this is due to an intricacy of Cython. Despite using relative imports, Cython expects the Criterion instance to come from a package called sklearn, not called sklearn1. On 5 September 2017 at 12:42, hanzi mao wrote: > Hi, > > I am researching on the source code of DecisionTree recently. Here are the > things I tried. > > > 1. Downloaded source code from github. > 2. run "python setup.py build_ext --inplace" to compile the sources in > the unzipped source folder. > 3. Try the following codes to see whether it works. Here I changed the > name of the sklearn folder to sklearn1 to differentiate it from the one > installed. > > > >>> from sklearn1 import tree > > >>> from sklearn.datasets import load_iris > > >>> iris = load_iris() > > >>> clf = tree.DecisionTreeClassifier() > > >>> clf = clf.fit(iris.data, iris.target) > > Traceback (most recent call last): > > File "", line 1, in > > File "sklearn1\tree\tree.py", line 790, in fit > > X_idx_sorted=X_idx_sorted) > > File "sklearn1\tree\tree.py", line 341, in fit > > self.presort) > > TypeError: Argument 'criterion' has incorrect type (expected > sklearn.tree._criterion.Criterion, got sklearn.tree._criterion.Gini) > > Then a weird error happened. Actually I also tried the newest stable > version of scikit-learn earlier today. It had the same error. So I was > thinking maybe try the newest version in github might help. Unlikely, it > didn't. > > I have limited knowledge about the source code of scikit-learn. I am > wondering if anyone could help me with this. > > Thanks! > > Best, > Hanna > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Sep 4 23:29:09 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 4 Sep 2017 23:29:09 -0400 Subject: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder In-Reply-To: References: Message-ID: Hi, Hanna, I think Joel is right and the renaming is probably causing the issues. Instead of renaming the package to sklearn1, consider modifying, compiling, and installing sklearn in a virtual environment. I am not sure if you are using conda, in this case, creating a new virtual env for development would be really straight forward: conda create -n 'my-sklearn-dev' source activate my-sklearn-dev There are also a bunch of Python packages out there that do essentially the same thing (https://docs.python.org/3/tutorial/venv.html); I am not sure which one people generally recommend/prefer. Anyway, to use venv that should be available in Python already, you could do e.g., python -m venv my-sklearn-dev source my-sklearn-dev/bin/activate Best, Sebastian > On Sep 4, 2017, at 11:21 PM, Joel Nothman wrote: > > I suspect this is due to an intricacy of Cython. Despite using relative imports, Cython expects the Criterion instance to come from a package called sklearn, not called sklearn1. > > On 5 September 2017 at 12:42, hanzi mao wrote: > > Hi, > > I am researching on the source code of DecisionTree recently. Here are the things I tried. > > ? Downloaded source code from github. > ? run "python setup.py build_ext --inplace" to compile the sources in the unzipped source folder. > ? Try the following codes to see whether it works. Here I changed the name of the sklearn folder to sklearn1 to differentiate it from the one installed. > > > > >>> from sklearn1 import tree > > > >>> from sklearn.datasets import load_iris > > > >>> iris = load_iris() > > > >>> clf = tree.DecisionTreeClassifier() > > > >>> clf = clf.fit(iris.data, iris.target) > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "sklearn1\tree\tree.py", line 790, in fit > > > X_idx_sorted=X_idx_sorted) > > > File "sklearn1\tree\tree.py", line 341, in fit > > > self.presort) > > > TypeError: Argument 'criterion' has incorrect type (expected sklearn.tree._criterion.Criterion, got sklearn.tree._criterion.Gini) > > > Then a weird error happened. Actually I also tried the newest stable version of scikit-learn earlier today. It had the same error. So I was thinking maybe try the newest version in github might help. Unlikely, it didn't. > > I have limited knowledge about the source code of scikit-learn. I am wondering if anyone could help me with this. > > Thanks! > > Best, > Hanna > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From hzmao at hotmail.com Tue Sep 5 00:26:09 2017 From: hzmao at hotmail.com (hanzi mao) Date: Tue, 5 Sep 2017 04:26:09 +0000 Subject: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder In-Reply-To: References: , Message-ID: Thanks Sebastian and Joel! It works after I run the codes in the newly created virtual environment. I thought after the sources were compiled, it was ok to change the folder name. Also thanks for teaching me to use the virtual environment! Best, Hanna ________________________________ From: scikit-learn on behalf of Sebastian Raschka Sent: Monday, September 4, 2017 11:29 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder Hi, Hanna, I think Joel is right and the renaming is probably causing the issues. Instead of renaming the package to sklearn1, consider modifying, compiling, and installing sklearn in a virtual environment. I am not sure if you are using conda, in this case, creating a new virtual env for development would be really straight forward: conda create -n 'my-sklearn-dev' source activate my-sklearn-dev There are also a bunch of Python packages out there that do essentially the same thing (https://docs.python.org/3/tutorial/venv.html); I am not sure which one people generally recommend/prefer. 12. Virtual Environments and Packages ? Python 3.6.2 ... docs.python.org 12.1. Introduction? Python applications will often use packages and modules that don?t come as part of the standard library. Applications will sometimes need a ... Anyway, to use venv that should be available in Python already, you could do e.g., python -m venv my-sklearn-dev source my-sklearn-dev/bin/activate Best, Sebastian > On Sep 4, 2017, at 11:21 PM, Joel Nothman wrote: > > I suspect this is due to an intricacy of Cython. Despite using relative imports, Cython expects the Criterion instance to come from a package called sklearn, not called sklearn1. > > On 5 September 2017 at 12:42, hanzi mao wrote: > > Hi, > > I am researching on the source code of DecisionTree recently. Here are the things I tried. > > ? Downloaded source code from github. > ? run "python setup.py build_ext --inplace" to compile the sources in the unzipped source folder. > ? Try the following codes to see whether it works. Here I changed the name of the sklearn folder to sklearn1 to differentiate it from the one installed. > > > > >>> from sklearn1 import tree > > > >>> from sklearn.datasets import load_iris > > > >>> iris = load_iris() > > > >>> clf = tree.DecisionTreeClassifier() > > > >>> clf = clf.fit(iris.data, iris.target) > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "sklearn1\tree\tree.py", line 790, in fit > > > X_idx_sorted=X_idx_sorted) > > > File "sklearn1\tree\tree.py", line 341, in fit > > > self.presort) > > > TypeError: Argument 'criterion' has incorrect type (expected sklearn.tree._criterion.Criterion, got sklearn.tree._criterion.Gini) > > > Then a weird error happened. Actually I also tried the newest stable version of scikit-learn earlier today. It had the same error. So I was thinking maybe try the newest version in github might help. Unlikely, it didn't. > > I have limited knowledge about the source code of scikit-learn. I am wondering if anyone could help me with this. > > Thanks! > > Best, > Hanna > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Tue Sep 5 09:39:03 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 5 Sep 2017 15:39:03 +0200 Subject: [scikit-learn] combining datasets from different sources Message-ID: Greetings, I am working on a problem that involves predicting the binding affinity of small molecules on a receptor structure (is regression problem, not classification). I have multiple small datasets of molecules with measured binding affinities on a receptor, but each dataset was measured in different experimental conditions and therefore I cannot use them all together as trainning set. So, instead of using them individually, I was wondering whether there is a method to combine them all into a super training set. The first way I could think of is to convert the binding affinities to Z-scores and then combine all the small datasets of molecules. But this is would be inaccurate because, firstly the datasets are very small (10-50 molecules each), and secondly, the range of binding affinities differs in each experiment (some datasets contain really strong binders, while others do not, etc.). Is there any other approach to combine datasets with values coming from different sources? Maybe if someone points me to the right reference I could read and understand if it is applicable to my case. Thanks, Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcrudy at gmail.com Tue Sep 5 13:02:57 2017 From: jcrudy at gmail.com (Jason Rudy) Date: Tue, 5 Sep 2017 10:02:57 -0700 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: Message-ID: Thomas, This is sort of related to the problem I did my M.S. thesis on years ago: cross-platform normalization of gene expression data. If you google that term you'll find some papers. The situation is somewhat different, though, because with microarrays or RNA-seq you get thousands of data points for each experiment, which makes it easier to estimate the batch effect. The principle is the similar, however. If I were in your situation, I would consider whether I have any of the following advantages: 1. Some molecules that appear in multiple data sets 2. Detailed information about the different experimental conditions 3. Physical/chemical models of how experimental conditions influence binding affinity If you have any of the above, you can potentially use them to improve your estimates. You could also consider using experiment ID as a categorical predictor in a sufficiently general regression method. Lastly, you may already know this, but the term "meta-analysis" is relevant here, and you can google for specific techniques. Most of these would be more limited than what you are envisioning, I think. Best, Jason On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis wrote: > Greetings, > > I am working on a problem that involves predicting the binding affinity of > small molecules on a receptor structure (is regression problem, not > classification). I have multiple small datasets of molecules with measured > binding affinities on a receptor, but each dataset was measured in > different experimental conditions and therefore I cannot use them all > together as trainning set. So, instead of using them individually, I was > wondering whether there is a method to combine them all into a super > training set. The first way I could think of is to convert the binding > affinities to Z-scores and then combine all the small datasets of > molecules. But this is would be inaccurate because, firstly the datasets > are very small (10-50 molecules each), and secondly, the range of binding > affinities differs in each experiment (some datasets contain really strong > binders, while others do not, etc.). Is there any other approach to combine > datasets with values coming from different sources? Maybe if someone points > me to the right reference I could read and understand if it is applicable > to my case. > > Thanks, > Thomas > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Sep 5 13:35:45 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 5 Sep 2017 13:35:45 -0400 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: Message-ID: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Another approach would be to pose this as a "ranking" problem to predict relative affinities rather than absolute affinities. E.g., if you have data from one (or more) molecules that has/have been tested under 2 or more experimental conditions, you can rank the other molecules accordingly or normalize. E.g. if you observe that the binding affinity of molecule a is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding affinities of molecule B are -10 and -12 kcal/mol, respectively, that should give you some information for normalizing the values from assay 2 (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and might be error prone, but so are experimental assays ... (when I sometimes look at the std error/CI of the data I get from collaborators ... well, it seems that absolute binding affinities have always taken with a grain of salt anyway) Best, Sebastian > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: > > Thomas, > > This is sort of related to the problem I did my M.S. thesis on years ago: cross-platform normalization of gene expression data. If you google that term you'll find some papers. The situation is somewhat different, though, because with microarrays or RNA-seq you get thousands of data points for each experiment, which makes it easier to estimate the batch effect. The principle is the similar, however. > > If I were in your situation, I would consider whether I have any of the following advantages: > > 1. Some molecules that appear in multiple data sets > 2. Detailed information about the different experimental conditions > 3. Physical/chemical models of how experimental conditions influence binding affinity > > If you have any of the above, you can potentially use them to improve your estimates. You could also consider using experiment ID as a categorical predictor in a sufficiently general regression method. > > Lastly, you may already know this, but the term "meta-analysis" is relevant here, and you can google for specific techniques. Most of these would be more limited than what you are envisioning, I think. > > Best, > > Jason > > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis wrote: > Greetings, > > I am working on a problem that involves predicting the binding affinity of small molecules on a receptor structure (is regression problem, not classification). I have multiple small datasets of molecules with measured binding affinities on a receptor, but each dataset was measured in different experimental conditions and therefore I cannot use them all together as trainning set. So, instead of using them individually, I was wondering whether there is a method to combine them all into a super training set. The first way I could think of is to convert the binding affinities to Z-scores and then combine all the small datasets of molecules. But this is would be inaccurate because, firstly the datasets are very small (10-50 molecules each), and secondly, the range of binding affinities differs in each experiment (some datasets contain really strong binders, while others do not, etc.). Is there any other approach to combine datasets with values coming from different sources? Maybe if someone points me to the right reference I could read and understand if it is applicable to my case. > > Thanks, > Thomas > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From maciek at wojcikowski.pl Tue Sep 5 14:33:39 2017 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Tue, 5 Sep 2017 20:33:39 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: Hi Thomas and others, It also really depend on how many data points you have on each compound. If you had more than a few then there are few options. If you get two very distinct activities for one ligand. I'd discard such samples as ambiguous or decide on one of the assays/experiments (the one with lower error). The exact problem was faced by PDBbind creators, I'd also look there for details what they did with their activities. To follow up Sebastians suggestion: have you checked how different ranks/Z-scores you get? Check out the Kendall Tau. Anyhow, you could build local models for a specific experimental methods. In our recent publication on slightly different area (protein-ligand scoring function), we show that the RF build on one target is just slightly better than the RF build on many targets (we've used DUD-E database); Checkout the "horizontal" and "per-target" splits https://www.nature.com/articles/srep46710. Unfortunately, this may change for different models. Plus the molecular descriptors used, which we know nothing about. I hope that helped a bit. ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : > Another approach would be to pose this as a "ranking" problem to predict > relative affinities rather than absolute affinities. E.g., if you have data > from one (or more) molecules that has/have been tested under 2 or more > experimental conditions, you can rank the other molecules accordingly or > normalize. E.g. if you observe that the binding affinity of molecule a is > -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding > affinities of molecule B are -10 and -12 kcal/mol, respectively, that > should give you some information for normalizing the values from assay 2 > (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and > might be error prone, but so are experimental assays ... (when I sometimes > look at the std error/CI of the data I get from collaborators ... well, it > seems that absolute binding affinities have always taken with a grain of > salt anyway) > > Best, > Sebastian > > > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: > > > > Thomas, > > > > This is sort of related to the problem I did my M.S. thesis on years > ago: cross-platform normalization of gene expression data. If you google > that term you'll find some papers. The situation is somewhat different, > though, because with microarrays or RNA-seq you get thousands of data > points for each experiment, which makes it easier to estimate the batch > effect. The principle is the similar, however. > > > > If I were in your situation, I would consider whether I have any of the > following advantages: > > > > 1. Some molecules that appear in multiple data sets > > 2. Detailed information about the different experimental conditions > > 3. Physical/chemical models of how experimental conditions influence > binding affinity > > > > If you have any of the above, you can potentially use them to improve > your estimates. You could also consider using experiment ID as a > categorical predictor in a sufficiently general regression method. > > > > Lastly, you may already know this, but the term "meta-analysis" is > relevant here, and you can google for specific techniques. Most of these > would be more limited than what you are envisioning, I think. > > > > Best, > > > > Jason > > > > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis > wrote: > > Greetings, > > > > I am working on a problem that involves predicting the binding affinity > of small molecules on a receptor structure (is regression problem, not > classification). I have multiple small datasets of molecules with measured > binding affinities on a receptor, but each dataset was measured in > different experimental conditions and therefore I cannot use them all > together as trainning set. So, instead of using them individually, I was > wondering whether there is a method to combine them all into a super > training set. The first way I could think of is to convert the binding > affinities to Z-scores and then combine all the small datasets of > molecules. But this is would be inaccurate because, firstly the datasets > are very small (10-50 molecules each), and secondly, the range of binding > affinities differs in each experiment (some datasets contain really strong > binders, while others do not, etc.). Is there any other approach to combine > datasets with values coming from different sources? Maybe if som > eone points me to the right reference I could read and understand if it > is applicable to my case. > > > > Thanks, > > Thomas > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Tue Sep 5 18:29:45 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Wed, 6 Sep 2017 00:29:45 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: Thanks Jason, Sebastian and Maciek! I believe from all the suggestions, the most feasible solutions is to look experimental assays which overlap by at least two compounds, and then adjust the binding affinities of one of them by looking in their difference in both assays. Sebastian mentioned the simplest scenario, where the shift for both compounds is 2 kcal/mol. However, he neglected to mention that the ratio between the affinities of the two compounds in each assay also matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but -10/-12=0.83 in assay B. Ideally that should also be taken into account to select the right transformation function for the values from Assay B. Is anybody away of any clever algorithm to select the right transformation function for such a case? I am sure there exists. The other approach would be to train different predictors from each assay and then apply a data fusion technique (e.g. min rank). But that wouldn't be that elegant. @Maciek To my understanding, the paper you cited addresses a classification problem (actives, inactives) by implementing Random Forrest Classfiers. My case is a Regression problem. best, Thomas On 5 September 2017 at 20:33, Maciek W?jcikowski wrote: > Hi Thomas and others, > > It also really depend on how many data points you have on each compound. > If you had more than a few then there are few options. If you get two very > distinct activities for one ligand. I'd discard such samples as ambiguous > or decide on one of the assays/experiments (the one with lower error). The > exact problem was faced by PDBbind creators, I'd also look there for > details what they did with their activities. > > To follow up Sebastians suggestion: have you checked how different > ranks/Z-scores you get? Check out the Kendall Tau. > > Anyhow, you could build local models for a specific experimental methods. > In our recent publication on slightly different area (protein-ligand > scoring function), we show that the RF build on one target is just slightly > better than the RF build on many targets (we've used DUD-E database); > Checkout the "horizontal" and "per-target" splits https://www.nature.com/ > articles/srep46710. Unfortunately, this may change for different models. > Plus the molecular descriptors used, which we know nothing about. > > I hope that helped a bit. > > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : > >> Another approach would be to pose this as a "ranking" problem to predict >> relative affinities rather than absolute affinities. E.g., if you have data >> from one (or more) molecules that has/have been tested under 2 or more >> experimental conditions, you can rank the other molecules accordingly or >> normalize. E.g. if you observe that the binding affinity of molecule a is >> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >> should give you some information for normalizing the values from assay 2 >> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >> might be error prone, but so are experimental assays ... (when I sometimes >> look at the std error/CI of the data I get from collaborators ... well, it >> seems that absolute binding affinities have always taken with a grain of >> salt anyway) >> >> Best, >> Sebastian >> >> > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: >> > >> > Thomas, >> > >> > This is sort of related to the problem I did my M.S. thesis on years >> ago: cross-platform normalization of gene expression data. If you google >> that term you'll find some papers. The situation is somewhat different, >> though, because with microarrays or RNA-seq you get thousands of data >> points for each experiment, which makes it easier to estimate the batch >> effect. The principle is the similar, however. >> > >> > If I were in your situation, I would consider whether I have any of the >> following advantages: >> > >> > 1. Some molecules that appear in multiple data sets >> > 2. Detailed information about the different experimental conditions >> > 3. Physical/chemical models of how experimental conditions influence >> binding affinity >> > >> > If you have any of the above, you can potentially use them to improve >> your estimates. You could also consider using experiment ID as a >> categorical predictor in a sufficiently general regression method. >> > >> > Lastly, you may already know this, but the term "meta-analysis" is >> relevant here, and you can google for specific techniques. Most of these >> would be more limited than what you are envisioning, I think. >> > >> > Best, >> > >> > Jason >> > >> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis >> wrote: >> > Greetings, >> > >> > I am working on a problem that involves predicting the binding affinity >> of small molecules on a receptor structure (is regression problem, not >> classification). I have multiple small datasets of molecules with measured >> binding affinities on a receptor, but each dataset was measured in >> different experimental conditions and therefore I cannot use them all >> together as trainning set. So, instead of using them individually, I was >> wondering whether there is a method to combine them all into a super >> training set. The first way I could think of is to convert the binding >> affinities to Z-scores and then combine all the small datasets of >> molecules. But this is would be inaccurate because, firstly the datasets >> are very small (10-50 molecules each), and secondly, the range of binding >> affinities differs in each experiment (some datasets contain really strong >> binders, while others do not, etc.). Is there any other approach to combine >> datasets with values coming from different sources? Maybe if som >> eone points me to the right reference I could read and understand if it >> is applicable to my case. >> > >> > Thanks, >> > Thomas >> > >> > -- >> > ====================================================================== >> > Dr Thomas Evangelidis >> > Post-doctoral Researcher >> > CEITEC - Central European Institute of Technology >> > Masaryk University >> > Kamenice 5/A35/2S049, >> > 62500 Brno, Czech Republic >> > >> > email: tevang at pharm.uoa.gr >> > tevang3 at gmail.com >> > >> > website: https://sites.google.com/site/thomasevangelidishomepage/ >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From qinhanmin2005 at sina.com Wed Sep 6 08:51:24 2017 From: qinhanmin2005 at sina.com (qinhanmin2005 at sina.com) Date: Wed, 06 Sep 2017 20:51:24 +0800 Subject: [scikit-learn] Need suggestions on the example about discretization Message-ID: <20170906125124.31FC410200EA@webmail.sinamail.sina.com.cn> Scikit-learn recently supported discretization(KBinsDiscretizer, see discrete branch) and we need an example to illustrate the usage of it. I have proposed a draft in https://github.com/scikit-learn/scikit-learn/issues/9339:(1)use the iris dataset (only use two features)(2)plot the data before and after discretization(3)train a classifier using the data before and after discretization and compare the result from cross validation Since I'm not an expert of machine learning, I'm wondering whether it is a good example. Could someone from the community provide some ideas or suggestions? Thanks a lot. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alefevre at ykems.com Wed Sep 6 11:56:32 2017 From: alefevre at ykems.com (Lefevre, Augustin) Date: Wed, 6 Sep 2017 15:56:32 +0000 Subject: [scikit-learn] using a predictor as transformer Message-ID: Hi all, I am playing with the pipeline features of sklearn and it seems that I can't use a prediction algorithm as intermediate step. For instance, in the example below I use the output of a lasso as an additional feature to feed a random forest, in such a way that feature selection the Lasso does some preliminary feature selection. But I get a : "TypeError: All estimators should implement fit and transform." So I would like to add a transform method to the Lasso estimator so that it can be used in a FeatureUnion. Is that possible ? Best regards X=np.hstack((np.random.randn(500,10),np.random.randint(0,10,(500,10)))) # regressor variables y=np.random.randn(500) # target variable ct_get = FunctionTransformer(lambda d:d[0:10]) # transformer to extract continuous variables dt_get = FunctionTransformer(lambda d:d[11:20]) # transformer to extract discrete variables # first step is a regression pipeline reg = Pipeline([('ct_vars',ct_get),('scaler',StandardScaler()),('poly',PolynomialFeatures(degree=3)),('lasso',Lasso())]) # A random forest feeds on the discrete part of the data + one continuous variable estimator = Pipeline([('level1',FeatureUnion([('dt_vars',dt_get),('reg',reg)])),('rf',RandomForestRegressor())]) estimator.fit(X,y) print "R^2 score is :",estimator.score(X,y) Augustin LEFEVRE| Consultant Senior | Ykems | - M : +33 7 77 97 94 89 | alefevre at ykems.com | www.ykems.com [https://www.beijaflore.com/_mailing/signature/image001.png] [https://www.beijaflore.com/_mailing/signature/image002.png] [https://www.beijaflore.com/_mailing/signature/image003.png] [https://www.beijaflore.com/_mailing/signature/image004.png] P Save a tree ! Think before you print SECURE BUSINESS This message and its attachment contain information that may be privileged or confidential and is the property of Beijaflore. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, use or rely on the information contained in this email. If you receive this message in error, please notify the sender immediately and delete all copies of this message. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 601 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 390 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 1422 bytes Desc: image003.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.png Type: image/png Size: 887 bytes Desc: image004.png URL: From t3kcit at gmail.com Wed Sep 6 13:49:31 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 6 Sep 2017 13:49:31 -0400 Subject: [scikit-learn] using a predictor as transformer In-Reply-To: References: Message-ID: <1404e3cf-94e6-eea3-c41c-144e8b752504@gmail.com> If you want to use lasso for feature selection in a pipeline you have to wrap it in SelectFromModel. On 09/06/2017 11:56 AM, Lefevre, Augustin wrote: > > Hi all, > > I am playing with the pipeline features of sklearn and it seems that I > can?t use a prediction algorithm as intermediate step. > > For instance, in the example below I use the output of a lasso as an > additional feature to feed a random forest, in such a way that feature > selection the Lasso does some preliminary feature selection. > > But I get a : ?TypeError: All estimators should implement fit and > transform.? > > So I would like to add a transform method to the Lasso estimator so > that it can be used in a FeatureUnion. Is that possible ? > > Best regards > > X=np.hstack((np.random.randn(500,10),np.random.randint(0,10,(500,10)))) > /# regressor variables > /y=np.random.randn(500) /# target variable > /ct_get = FunctionTransformer(*lambda *d:d[0:10]) /# transformer to > extract continuous variables > /dt_get = FunctionTransformer(*lambda *d:d[11:20]) /# transformer to > extract discrete variables > > # first step is a regression pipeline > /reg = > Pipeline([(*'ct_vars'*,ct_get),(*'scaler'*,StandardScaler()),(*'poly'*,PolynomialFeatures(degree=3)),(*'lasso'*,Lasso())]) > /# A random forest feeds on the discrete part of the data + one > continuous variable > /estimator = > Pipeline([(*'level1'*,FeatureUnion([(*'dt_vars'*,dt_get),(*'reg'*,reg)])),(*'rf'*,RandomForestRegressor())]) > > estimator.fit(X,y) > *print **"R^2 score is :"*,estimator.score(X,y) > > *Augustin LEFEVRE*|**Consultant Senior |Ykems| - > > M : +33 7 77 97 94 89 | alefevre at ykems.com > | www.ykems.com > > https://www.beijaflore.com/_mailing/signature/image001.png > https://www.beijaflore.com/_mailing/signature/image002.png > https://www.beijaflore.com/_mailing/signature/image003.png > https://www.beijaflore.com/_mailing/signature/image004.png > > > PSave a tree ! Think before you print > > /SECURE BUSINESS/ > > /This message and its attachment contain information that may be > privileged or confidential and is the property of Beijaflore. It is > intended only for the person to whom it is addressed. If you are not > the intended recipient, you are not authorized to read, print, retain, > copy, disseminate, distribute, use or rely on the information > contained in this email. If you receive this message in error, please > notify the sender immediately and delete all copies of this message./ > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 601 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 390 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 1422 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.png Type: image/png Size: 887 bytes Desc: not available URL: From g.lemaitre58 at gmail.com Wed Sep 6 12:23:23 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Wed, 06 Sep 2017 18:23:23 +0200 Subject: [scikit-learn] using a predictor as transformer In-Reply-To: References: Message-ID: <20170906162323.4870223.94839.39244@gmail.com> An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Sep 6 14:48:26 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Wed, 6 Sep 2017 20:48:26 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: ?? After some though about this problem today, I think it is an objective function minimization problem, when the objective function can be the root mean square deviation (RMSD) between the affinities of the common molecules in the two data sets. I could work iteratively, first rescale and fit assay B to match A, then proceed to assay C and so forth. Or alternatively, for each Assay I need to find two missing variables, the optimum shift Sh and the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD between the binding affinities of the overlapping molecules. Any idea how I can do that with scikit-learn? On 6 September 2017 at 00:29, Thomas Evangelidis wrote: > Thanks Jason, Sebastian and Maciek! > > I believe from all the suggestions, the most feasible solutions is to look > experimental assays which overlap by at least two compounds, and then > adjust the binding affinities of one of them by looking in their difference > in both assays. Sebastian mentioned the simplest scenario, where the shift > for both compounds is 2 kcal/mol. However, he neglected to mention that the > ratio between the affinities of the two compounds in each assay also > matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but > -10/-12=0.83 in assay B. Ideally that should also be taken into account to > select the right transformation function for the values from Assay B. Is > anybody away of any clever algorithm to select the right transformation > function for such a case? I am sure there exists. > > The other approach would be to train different predictors from each assay > and then apply a data fusion technique (e.g. min rank). But that wouldn't > be that elegant. > > @Maciek To my understanding, the paper you cited addresses a > classification problem (actives, inactives) by implementing Random Forrest > Classfiers. My case is a Regression problem. > > > best, > Thomas > > > On 5 September 2017 at 20:33, Maciek W?jcikowski > wrote: > >> Hi Thomas and others, >> >> It also really depend on how many data points you have on each compound. >> If you had more than a few then there are few options. If you get two very >> distinct activities for one ligand. I'd discard such samples as ambiguous >> or decide on one of the assays/experiments (the one with lower error). The >> exact problem was faced by PDBbind creators, I'd also look there for >> details what they did with their activities. >> >> To follow up Sebastians suggestion: have you checked how different >> ranks/Z-scores you get? Check out the Kendall Tau. >> >> Anyhow, you could build local models for a specific experimental methods. >> In our recent publication on slightly different area (protein-ligand >> scoring function), we show that the RF build on one target is just slightly >> better than the RF build on many targets (we've used DUD-E database); >> Checkout the "horizontal" and "per-target" splits >> https://www.nature.com/articles/srep46710. Unfortunately, this may >> change for different models. Plus the molecular descriptors used, which we >> know nothing about. >> >> I hope that helped a bit. >> >> ---- >> Pozdrawiam, | Best regards, >> Maciek W?jcikowski >> maciek at wojcikowski.pl >> >> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : >> >>> Another approach would be to pose this as a "ranking" problem to predict >>> relative affinities rather than absolute affinities. E.g., if you have data >>> from one (or more) molecules that has/have been tested under 2 or more >>> experimental conditions, you can rank the other molecules accordingly or >>> normalize. E.g. if you observe that the binding affinity of molecule a is >>> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>> should give you some information for normalizing the values from assay 2 >>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>> might be error prone, but so are experimental assays ... (when I sometimes >>> look at the std error/CI of the data I get from collaborators ... well, it >>> seems that absolute binding affinities have always taken with a grain of >>> salt anyway) >>> >>> Best, >>> Sebastian >>> >>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: >>> > >>> > Thomas, >>> > >>> > This is sort of related to the problem I did my M.S. thesis on years >>> ago: cross-platform normalization of gene expression data. If you google >>> that term you'll find some papers. The situation is somewhat different, >>> though, because with microarrays or RNA-seq you get thousands of data >>> points for each experiment, which makes it easier to estimate the batch >>> effect. The principle is the similar, however. >>> > >>> > If I were in your situation, I would consider whether I have any of >>> the following advantages: >>> > >>> > 1. Some molecules that appear in multiple data sets >>> > 2. Detailed information about the different experimental conditions >>> > 3. Physical/chemical models of how experimental conditions influence >>> binding affinity >>> > >>> > If you have any of the above, you can potentially use them to improve >>> your estimates. You could also consider using experiment ID as a >>> categorical predictor in a sufficiently general regression method. >>> > >>> > Lastly, you may already know this, but the term "meta-analysis" is >>> relevant here, and you can google for specific techniques. Most of these >>> would be more limited than what you are envisioning, I think. >>> > >>> > Best, >>> > >>> > Jason >>> > >>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis >>> wrote: >>> > Greetings, >>> > >>> > I am working on a problem that involves predicting the binding >>> affinity of small molecules on a receptor structure (is regression problem, >>> not classification). I have multiple small datasets of molecules with >>> measured binding affinities on a receptor, but each dataset was measured in >>> different experimental conditions and therefore I cannot use them all >>> together as trainning set. So, instead of using them individually, I was >>> wondering whether there is a method to combine them all into a super >>> training set. The first way I could think of is to convert the binding >>> affinities to Z-scores and then combine all the small datasets of >>> molecules. But this is would be inaccurate because, firstly the datasets >>> are very small (10-50 molecules each), and secondly, the range of binding >>> affinities differs in each experiment (some datasets contain really strong >>> binders, while others do not, etc.). Is there any other approach to combine >>> datasets with values coming from different sources? Maybe if som >>> eone points me to the right reference I could read and understand if it >>> is applicable to my case. >>> > >>> > Thanks, >>> > Thomas >>> > >>> > -- >>> > ====================================================================== >>> > Dr Thomas Evangelidis >>> > Post-doctoral Researcher >>> > CEITEC - Central European Institute of Technology >>> > Masaryk University >>> > Kamenice 5/A35/2S049, >>> > 62500 Brno, Czech Republic >>> > >>> > email: tevang at pharm.uoa.gr >>> > tevang3 at gmail.com >>> > >>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>> > >>> > >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Thu Sep 7 09:29:26 2017 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Thu, 7 Sep 2017 15:29:26 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: I think StandardScaller is what you want. For each assay you will get mean and var. Average mean would be the "optimal" shift and average variance the spread. But would this value make any physical sense? Considering the RF-Score-VS: In fact it's a regressor and it predicts a real value, not a class. Although it is validated mostly using Enrichment Factor, the last figure shows top results for regression vs Vina. ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis : > ?? > After some though about this problem today, I think it is an objective > function minimization problem, when the objective function can be the root > mean square deviation (RMSD) between the affinities of the common molecules > in the two data sets. I could work iteratively, first rescale and fit assay > B to match A, then proceed to assay C and so forth. Or alternatively, for > each Assay I need to find two missing variables, the optimum shift Sh and > the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the > optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD > between the binding affinities of the overlapping molecules. Any idea how I > can do that with scikit-learn? > > > On 6 September 2017 at 00:29, Thomas Evangelidis > wrote: > >> Thanks Jason, Sebastian and Maciek! >> >> I believe from all the suggestions, the most feasible solutions is to >> look experimental assays which overlap by at least two compounds, and then >> adjust the binding affinities of one of them by looking in their difference >> in both assays. Sebastian mentioned the simplest scenario, where the shift >> for both compounds is 2 kcal/mol. However, he neglected to mention that the >> ratio between the affinities of the two compounds in each assay also >> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >> select the right transformation function for the values from Assay B. Is >> anybody away of any clever algorithm to select the right transformation >> function for such a case? I am sure there exists. >> >> The other approach would be to train different predictors from each assay >> and then apply a data fusion technique (e.g. min rank). But that wouldn't >> be that elegant. >> >> @Maciek To my understanding, the paper you cited addresses a >> classification problem (actives, inactives) by implementing Random Forrest >> Classfiers. My case is a Regression problem. >> >> >> best, >> Thomas >> >> >> On 5 September 2017 at 20:33, Maciek W?jcikowski >> wrote: >> >>> Hi Thomas and others, >>> >>> It also really depend on how many data points you have on each compound. >>> If you had more than a few then there are few options. If you get two very >>> distinct activities for one ligand. I'd discard such samples as ambiguous >>> or decide on one of the assays/experiments (the one with lower error). The >>> exact problem was faced by PDBbind creators, I'd also look there for >>> details what they did with their activities. >>> >>> To follow up Sebastians suggestion: have you checked how different >>> ranks/Z-scores you get? Check out the Kendall Tau. >>> >>> Anyhow, you could build local models for a specific experimental >>> methods. In our recent publication on slightly different area >>> (protein-ligand scoring function), we show that the RF build on one target >>> is just slightly better than the RF build on many targets (we've used DUD-E >>> database); Checkout the "horizontal" and "per-target" splits >>> https://www.nature.com/articles/srep46710. Unfortunately, this may >>> change for different models. Plus the molecular descriptors used, which we >>> know nothing about. >>> >>> I hope that helped a bit. >>> >>> ---- >>> Pozdrawiam, | Best regards, >>> Maciek W?jcikowski >>> maciek at wojcikowski.pl >>> >>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : >>> >>>> Another approach would be to pose this as a "ranking" problem to >>>> predict relative affinities rather than absolute affinities. E.g., if you >>>> have data from one (or more) molecules that has/have been tested under 2 or >>>> more experimental conditions, you can rank the other molecules accordingly >>>> or normalize. E.g. if you observe that the binding affinity of molecule a >>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>>> should give you some information for normalizing the values from assay 2 >>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>>> might be error prone, but so are experimental assays ... (when I sometimes >>>> look at the std error/CI of the data I get from collaborators ... well, it >>>> seems that absolute binding affinities have always taken with a grain of >>>> salt anyway) >>>> >>>> Best, >>>> Sebastian >>>> >>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: >>>> > >>>> > Thomas, >>>> > >>>> > This is sort of related to the problem I did my M.S. thesis on years >>>> ago: cross-platform normalization of gene expression data. If you google >>>> that term you'll find some papers. The situation is somewhat different, >>>> though, because with microarrays or RNA-seq you get thousands of data >>>> points for each experiment, which makes it easier to estimate the batch >>>> effect. The principle is the similar, however. >>>> > >>>> > If I were in your situation, I would consider whether I have any of >>>> the following advantages: >>>> > >>>> > 1. Some molecules that appear in multiple data sets >>>> > 2. Detailed information about the different experimental conditions >>>> > 3. Physical/chemical models of how experimental conditions influence >>>> binding affinity >>>> > >>>> > If you have any of the above, you can potentially use them to improve >>>> your estimates. You could also consider using experiment ID as a >>>> categorical predictor in a sufficiently general regression method. >>>> > >>>> > Lastly, you may already know this, but the term "meta-analysis" is >>>> relevant here, and you can google for specific techniques. Most of these >>>> would be more limited than what you are envisioning, I think. >>>> > >>>> > Best, >>>> > >>>> > Jason >>>> > >>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis >>>> wrote: >>>> > Greetings, >>>> > >>>> > I am working on a problem that involves predicting the binding >>>> affinity of small molecules on a receptor structure (is regression problem, >>>> not classification). I have multiple small datasets of molecules with >>>> measured binding affinities on a receptor, but each dataset was measured in >>>> different experimental conditions and therefore I cannot use them all >>>> together as trainning set. So, instead of using them individually, I was >>>> wondering whether there is a method to combine them all into a super >>>> training set. The first way I could think of is to convert the binding >>>> affinities to Z-scores and then combine all the small datasets of >>>> molecules. But this is would be inaccurate because, firstly the datasets >>>> are very small (10-50 molecules each), and secondly, the range of binding >>>> affinities differs in each experiment (some datasets contain really strong >>>> binders, while others do not, etc.). Is there any other approach to combine >>>> datasets with values coming from different sources? Maybe if som >>>> eone points me to the right reference I could read and understand if >>>> it is applicable to my case. >>>> > >>>> > Thanks, >>>> > Thomas >>>> > >>>> > -- >>>> > ============================================================ >>>> ========== >>>> > Dr Thomas Evangelidis >>>> > Post-doctoral Researcher >>>> > CEITEC - Central European Institute of Technology >>>> > Masaryk University >>>> > Kamenice 5/A35/2S049, >>>> > 62500 Brno, Czech Republic >>>> > >>>> > email: tevang at pharm.uoa.gr >>>> > tevang3 at gmail.com >>>> > >>>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> >> ====================================================================== >> >> Dr Thomas Evangelidis >> >> Post-doctoral Researcher >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/2S049, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Sep 7 09:57:01 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 7 Sep 2017 15:57:01 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: On 7 September 2017 at 15:29, Maciek W?jcikowski wrote: > I think StandardScaller is what you want. For each assay you will get mean > and var. Average mean would be the "optimal" shift and average variance the > spread. But would this value make any physical sense? > > ?I think you missed my point. The problem was scaling with restraints, the RMSD between the binding affinity of the common ligands ?must be minimized uppon scaling. Anyway, I managed to work it out using scipy.optimize. > Considering the RF-Score-VS: In fact it's a regressor and it predicts a > real value, not a class. Although it is validated mostly using Enrichment > Factor, the last figure shows top results for regression vs Vina. > > ?To my understanding you trained the RF using class information (active, inactive) and the prediction was a probability value.? If the probability is above 0.5 then the compound is an active, otherwise it is an inactive. This is how sklearn.ensemble.RandomForestClassifier works. In contrast I train MLPRegressors using binding affinities (scalar values) and the predictions are binding affinities (scallar values). > ---- > Pozdrawiam, | Best regards, > Maciek W?jcikowski > maciek at wojcikowski.pl > > 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis : > >> ?? >> After some though about this problem today, I think it is an objective >> function minimization problem, when the objective function can be the root >> mean square deviation (RMSD) between the affinities of the common molecules >> in the two data sets. I could work iteratively, first rescale and fit assay >> B to match A, then proceed to assay C and so forth. Or alternatively, for >> each Assay I need to find two missing variables, the optimum shift Sh and >> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the >> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD >> between the binding affinities of the overlapping molecules. Any idea how I >> can do that with scikit-learn? >> >> >> On 6 September 2017 at 00:29, Thomas Evangelidis >> wrote: >> >>> Thanks Jason, Sebastian and Maciek! >>> >>> I believe from all the suggestions, the most feasible solutions is to >>> look experimental assays which overlap by at least two compounds, and then >>> adjust the binding affinities of one of them by looking in their difference >>> in both assays. Sebastian mentioned the simplest scenario, where the shift >>> for both compounds is 2 kcal/mol. However, he neglected to mention that the >>> ratio between the affinities of the two compounds in each assay also >>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >>> select the right transformation function for the values from Assay B. Is >>> anybody away of any clever algorithm to select the right transformation >>> function for such a case? I am sure there exists. >>> >>> The other approach would be to train different predictors from each >>> assay and then apply a data fusion technique (e.g. min rank). But that >>> wouldn't be that elegant. >>> >>> @Maciek To my understanding, the paper you cited addresses a >>> classification problem (actives, inactives) by implementing Random Forrest >>> Classfiers. My case is a Regression problem. >>> >>> >>> best, >>> Thomas >>> >>> >>> On 5 September 2017 at 20:33, Maciek W?jcikowski >>> wrote: >>> >>>> Hi Thomas and others, >>>> >>>> It also really depend on how many data points you have on each >>>> compound. If you had more than a few then there are few options. If you get >>>> two very distinct activities for one ligand. I'd discard such samples as >>>> ambiguous or decide on one of the assays/experiments (the one with lower >>>> error). The exact problem was faced by PDBbind creators, I'd also look >>>> there for details what they did with their activities. >>>> >>>> To follow up Sebastians suggestion: have you checked how different >>>> ranks/Z-scores you get? Check out the Kendall Tau. >>>> >>>> Anyhow, you could build local models for a specific experimental >>>> methods. In our recent publication on slightly different area >>>> (protein-ligand scoring function), we show that the RF build on one target >>>> is just slightly better than the RF build on many targets (we've used DUD-E >>>> database); Checkout the "horizontal" and "per-target" splits >>>> https://www.nature.com/articles/srep46710. Unfortunately, this may >>>> change for different models. Plus the molecular descriptors used, which we >>>> know nothing about. >>>> >>>> I hope that helped a bit. >>>> >>>> ---- >>>> Pozdrawiam, | Best regards, >>>> Maciek W?jcikowski >>>> maciek at wojcikowski.pl >>>> >>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : >>>> >>>>> Another approach would be to pose this as a "ranking" problem to >>>>> predict relative affinities rather than absolute affinities. E.g., if you >>>>> have data from one (or more) molecules that has/have been tested under 2 or >>>>> more experimental conditions, you can rank the other molecules accordingly >>>>> or normalize. E.g. if you observe that the binding affinity of molecule a >>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>>>> should give you some information for normalizing the values from assay 2 >>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>>>> might be error prone, but so are experimental assays ... (when I sometimes >>>>> look at the std error/CI of the data I get from collaborators ... well, it >>>>> seems that absolute binding affinities have always taken with a grain of >>>>> salt anyway) >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: >>>>> > >>>>> > Thomas, >>>>> > >>>>> > This is sort of related to the problem I did my M.S. thesis on years >>>>> ago: cross-platform normalization of gene expression data. If you google >>>>> that term you'll find some papers. The situation is somewhat different, >>>>> though, because with microarrays or RNA-seq you get thousands of data >>>>> points for each experiment, which makes it easier to estimate the batch >>>>> effect. The principle is the similar, however. >>>>> > >>>>> > If I were in your situation, I would consider whether I have any of >>>>> the following advantages: >>>>> > >>>>> > 1. Some molecules that appear in multiple data sets >>>>> > 2. Detailed information about the different experimental conditions >>>>> > 3. Physical/chemical models of how experimental conditions influence >>>>> binding affinity >>>>> > >>>>> > If you have any of the above, you can potentially use them to >>>>> improve your estimates. You could also consider using experiment ID as a >>>>> categorical predictor in a sufficiently general regression method. >>>>> > >>>>> > Lastly, you may already know this, but the term "meta-analysis" is >>>>> relevant here, and you can google for specific techniques. Most of these >>>>> would be more limited than what you are envisioning, I think. >>>>> > >>>>> > Best, >>>>> > >>>>> > Jason >>>>> > >>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis < >>>>> tevang3 at gmail.com> wrote: >>>>> > Greetings, >>>>> > >>>>> > I am working on a problem that involves predicting the binding >>>>> affinity of small molecules on a receptor structure (is regression problem, >>>>> not classification). I have multiple small datasets of molecules with >>>>> measured binding affinities on a receptor, but each dataset was measured in >>>>> different experimental conditions and therefore I cannot use them all >>>>> together as trainning set. So, instead of using them individually, I was >>>>> wondering whether there is a method to combine them all into a super >>>>> training set. The first way I could think of is to convert the binding >>>>> affinities to Z-scores and then combine all the small datasets of >>>>> molecules. But this is would be inaccurate because, firstly the datasets >>>>> are very small (10-50 molecules each), and secondly, the range of binding >>>>> affinities differs in each experiment (some datasets contain really strong >>>>> binders, while others do not, etc.). Is there any other approach to combine >>>>> datasets with values coming from different sources? Maybe if som >>>>> eone points me to the right reference I could read and understand if >>>>> it is applicable to my case. >>>>> > >>>>> > Thanks, >>>>> > Thomas >>>>> > >>>>> > -- >>>>> > ============================================================ >>>>> ========== >>>>> > Dr Thomas Evangelidis >>>>> > Post-doctoral Researcher >>>>> > CEITEC - Central European Institute of Technology >>>>> > Masaryk University >>>>> > Kamenice 5/A35/2S049, >>>>> > 62500 Brno, Czech Republic >>>>> > >>>>> > email: tevang at pharm.uoa.gr >>>>> > tevang3 at gmail.com >>>>> > >>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>>> > >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn at python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn at python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >> >> >> -- >> >> ====================================================================== >> >> Dr Thomas Evangelidis >> >> Post-doctoral Researcher >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/2S049, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Thu Sep 7 10:14:51 2017 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Thu, 7 Sep 2017 16:14:51 +0200 Subject: [scikit-learn] combining datasets from different sources In-Reply-To: References: <96A42451-D5CB-4FB5-B7E3-A6368837D040@gmail.com> Message-ID: 2017-09-07 15:57 GMT+02:00 Thomas Evangelidis : > > > On 7 September 2017 at 15:29, Maciek W?jcikowski > wrote: > >> I think StandardScaller is what you want. For each assay you will get >> mean and var. Average mean would be the "optimal" shift and average >> variance the spread. But would this value make any physical sense? >> >> ?I think you missed my point. The problem was scaling with restraints, > the RMSD between the binding affinity of the common ligands ?must be > minimized uppon scaling. Anyway, I managed to work it out using > scipy.optimize. > Yes, I meant the common ligand which would probably lead you to similar solution. Out of curiosity: is there a connection with the optimal shift and the type of assay in your case? > > > >> Considering the RF-Score-VS: In fact it's a regressor and it predicts a >> real value, not a class. Although it is validated mostly using Enrichment >> Factor, the last figure shows top results for regression vs Vina. >> >> ?To my understanding you trained the RF using class information (active, > inactive) and the prediction was a probability value.? If the probability > is above 0.5 then the compound is an active, otherwise it is an inactive. > This is how sklearn.ensemble.RandomForestClassifier works. > We trained RandomForestRegressor with binding affinities of DUD-E actives. The decoys were arbitrarily assigned 5.95 pK activity. > > In contrast I train MLPRegressors using binding affinities (scalar values) > and the predictions are binding affinities (scallar values). > We will have chance to talk it through in Berlin, see you there! > > > > >> ---- >> Pozdrawiam, | Best regards, >> Maciek W?jcikowski >> maciek at wojcikowski.pl >> >> 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis : >> >>> ?? >>> After some though about this problem today, I think it is an objective >>> function minimization problem, when the objective function can be the root >>> mean square deviation (RMSD) between the affinities of the common molecules >>> in the two data sets. I could work iteratively, first rescale and fit assay >>> B to match A, then proceed to assay C and so forth. Or alternatively, for >>> each Assay I need to find two missing variables, the optimum shift Sh and >>> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the >>> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD >>> between the binding affinities of the overlapping molecules. Any idea how I >>> can do that with scikit-learn? >>> >>> >>> On 6 September 2017 at 00:29, Thomas Evangelidis >>> wrote: >>> >>>> Thanks Jason, Sebastian and Maciek! >>>> >>>> I believe from all the suggestions, the most feasible solutions is to >>>> look experimental assays which overlap by at least two compounds, and then >>>> adjust the binding affinities of one of them by looking in their difference >>>> in both assays. Sebastian mentioned the simplest scenario, where the shift >>>> for both compounds is 2 kcal/mol. However, he neglected to mention that the >>>> ratio between the affinities of the two compounds in each assay also >>>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >>>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >>>> select the right transformation function for the values from Assay B. Is >>>> anybody away of any clever algorithm to select the right transformation >>>> function for such a case? I am sure there exists. >>>> >>>> The other approach would be to train different predictors from each >>>> assay and then apply a data fusion technique (e.g. min rank). But that >>>> wouldn't be that elegant. >>>> >>>> @Maciek To my understanding, the paper you cited addresses a >>>> classification problem (actives, inactives) by implementing Random Forrest >>>> Classfiers. My case is a Regression problem. >>>> >>>> >>>> best, >>>> Thomas >>>> >>>> >>>> On 5 September 2017 at 20:33, Maciek W?jcikowski >>> > wrote: >>>> >>>>> Hi Thomas and others, >>>>> >>>>> It also really depend on how many data points you have on each >>>>> compound. If you had more than a few then there are few options. If you get >>>>> two very distinct activities for one ligand. I'd discard such samples as >>>>> ambiguous or decide on one of the assays/experiments (the one with lower >>>>> error). The exact problem was faced by PDBbind creators, I'd also look >>>>> there for details what they did with their activities. >>>>> >>>>> To follow up Sebastians suggestion: have you checked how different >>>>> ranks/Z-scores you get? Check out the Kendall Tau. >>>>> >>>>> Anyhow, you could build local models for a specific experimental >>>>> methods. In our recent publication on slightly different area >>>>> (protein-ligand scoring function), we show that the RF build on one target >>>>> is just slightly better than the RF build on many targets (we've used DUD-E >>>>> database); Checkout the "horizontal" and "per-target" splits >>>>> https://www.nature.com/articles/srep46710. Unfortunately, this may >>>>> change for different models. Plus the molecular descriptors used, which we >>>>> know nothing about. >>>>> >>>>> I hope that helped a bit. >>>>> >>>>> ---- >>>>> Pozdrawiam, | Best regards, >>>>> Maciek W?jcikowski >>>>> maciek at wojcikowski.pl >>>>> >>>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka : >>>>> >>>>>> Another approach would be to pose this as a "ranking" problem to >>>>>> predict relative affinities rather than absolute affinities. E.g., if you >>>>>> have data from one (or more) molecules that has/have been tested under 2 or >>>>>> more experimental conditions, you can rank the other molecules accordingly >>>>>> or normalize. E.g. if you observe that the binding affinity of molecule a >>>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>>>>> should give you some information for normalizing the values from assay 2 >>>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>>>>> might be error prone, but so are experimental assays ... (when I sometimes >>>>>> look at the std error/CI of the data I get from collaborators ... well, it >>>>>> seems that absolute binding affinities have always taken with a grain of >>>>>> salt anyway) >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: >>>>>> > >>>>>> > Thomas, >>>>>> > >>>>>> > This is sort of related to the problem I did my M.S. thesis on >>>>>> years ago: cross-platform normalization of gene expression data. If you >>>>>> google that term you'll find some papers. The situation is somewhat >>>>>> different, though, because with microarrays or RNA-seq you get thousands of >>>>>> data points for each experiment, which makes it easier to estimate the >>>>>> batch effect. The principle is the similar, however. >>>>>> > >>>>>> > If I were in your situation, I would consider whether I have any of >>>>>> the following advantages: >>>>>> > >>>>>> > 1. Some molecules that appear in multiple data sets >>>>>> > 2. Detailed information about the different experimental conditions >>>>>> > 3. Physical/chemical models of how experimental conditions >>>>>> influence binding affinity >>>>>> > >>>>>> > If you have any of the above, you can potentially use them to >>>>>> improve your estimates. You could also consider using experiment ID as a >>>>>> categorical predictor in a sufficiently general regression method. >>>>>> > >>>>>> > Lastly, you may already know this, but the term "meta-analysis" is >>>>>> relevant here, and you can google for specific techniques. Most of these >>>>>> would be more limited than what you are envisioning, I think. >>>>>> > >>>>>> > Best, >>>>>> > >>>>>> > Jason >>>>>> > >>>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis < >>>>>> tevang3 at gmail.com> wrote: >>>>>> > Greetings, >>>>>> > >>>>>> > I am working on a problem that involves predicting the binding >>>>>> affinity of small molecules on a receptor structure (is regression problem, >>>>>> not classification). I have multiple small datasets of molecules with >>>>>> measured binding affinities on a receptor, but each dataset was measured in >>>>>> different experimental conditions and therefore I cannot use them all >>>>>> together as trainning set. So, instead of using them individually, I was >>>>>> wondering whether there is a method to combine them all into a super >>>>>> training set. The first way I could think of is to convert the binding >>>>>> affinities to Z-scores and then combine all the small datasets of >>>>>> molecules. But this is would be inaccurate because, firstly the datasets >>>>>> are very small (10-50 molecules each), and secondly, the range of binding >>>>>> affinities differs in each experiment (some datasets contain really strong >>>>>> binders, while others do not, etc.). Is there any other approach to combine >>>>>> datasets with values coming from different sources? Maybe if som >>>>>> eone points me to the right reference I could read and understand if >>>>>> it is applicable to my case. >>>>>> > >>>>>> > Thanks, >>>>>> > Thomas >>>>>> > >>>>>> > -- >>>>>> > ============================================================ >>>>>> ========== >>>>>> > Dr Thomas Evangelidis >>>>>> > Post-doctoral Researcher >>>>>> > CEITEC - Central European Institute of Technology >>>>>> > Masaryk University >>>>>> > Kamenice 5/A35/2S049, >>>>>> > 62500 Brno, Czech Republic >>>>>> > >>>>>> > email: tevang at pharm.uoa.gr >>>>>> > tevang3 at gmail.com >>>>>> > >>>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>>>> > >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > scikit-learn mailing list >>>>>> > scikit-learn at python.org >>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > scikit-learn mailing list >>>>>> > scikit-learn at python.org >>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> ====================================================================== >>>> >>>> Dr Thomas Evangelidis >>>> >>>> Post-doctoral Researcher >>>> CEITEC - Central European Institute of Technology >>>> Masaryk University >>>> Kamenice 5/A35/2S049, >>>> 62500 Brno, Czech Republic >>>> >>>> email: tevang at pharm.uoa.gr >>>> >>>> tevang3 at gmail.com >>>> >>>> >>>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>>> >>>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tesleft at hotmail.com Sun Sep 10 00:44:56 2017 From: tesleft at hotmail.com (Martin Lee) Date: Sun, 10 Sep 2017 04:44:56 +0000 Subject: [scikit-learn] how to make result less number of group with NearestNeighbors? Message-ID: nbrs = NearestNeighbors(n_neighbors=10,radius=100.0,metric='euclidean',algorithm='ball_tree').fit(testing1) distances, indices = nbrs.kneighbors(testing1) just expect when each point distance less than 100 then group into one group Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Sep 10 15:13:09 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 10 Sep 2017 21:13:09 +0200 Subject: [scikit-learn] control value range of MLPRegressor predictions Message-ID: Greetings, Is there any way to force the MLPRegressor to make predictions in the same value range as the training data? For example, if the training data range between -5 and -9, I don't want the predictions to range between -820 and -800. In fact, some times I get anti-correlated predictions, for example between 800 and 820 and I have to change the sign in order to calculate correlations with experimental values. Is there a way to control the value range explicitly or implicitly (by post-processing the predictions)? thanks Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Sep 10 16:03:35 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 10 Sep 2017 16:03:35 -0400 Subject: [scikit-learn] control value range of MLPRegressor predictions In-Reply-To: References: Message-ID: <72A94197-A1AF-4D57-BD6B-68941043AF5B@gmail.com> You could normalize the outputs (e.g., via min-max scaling). However, I think the more intuitive way would be to clip the predictions. E.g., say you are predicting house prices, it probably makes no sense to have a negative prediction, so you would clip the output at some value >0$ PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to -9 range. Is your training data from a different population then the one you use for testing/making predictions? Or maybe it's just an extreme case of overfitting. Best, Sebastian > On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis wrote: > > Greetings, > > Is there any way to force the MLPRegressor to make predictions in the same value range as the training data? For example, if the training data range between -5 and -9, I don't want the predictions to range between -820 and -800. In fact, some times I get anti-correlated predictions, for example between 800 and 820 and I have to change the sign in order to calculate correlations with experimental values. Is there a way to control the value range explicitly or implicitly (by post-processing the predictions)? > > thanks > Thomas > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Sun Sep 10 16:43:30 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 10 Sep 2017 22:43:30 +0200 Subject: [scikit-learn] control value range of MLPRegressor predictions In-Reply-To: <72A94197-A1AF-4D57-BD6B-68941043AF5B@gmail.com> References: <72A94197-A1AF-4D57-BD6B-68941043AF5B@gmail.com> Message-ID: On 10 September 2017 at 22:03, Sebastian Raschka wrote: > You could normalize the outputs (e.g., via min-max scaling). However, I > think the more intuitive way would be to clip the predictions. E.g., say > you are predicting house prices, it probably makes no sense to have a > negative prediction, so you would clip the output at some value >0$ > > ?By clipping you mean discarding the predictors that give values below/above the threshold? > PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to > -9 range. Is your training data from a different population then the one > you use for testing/making predictions? Or maybe it's just an extreme case > of overfitting. > > ?It is from the same population, but the training sets I use are very small (6-32 observations), so it must be over-fitting. We had that discussion in the past here, yet in practice I get good correlations with the experimental values using MLPRegressors.? > Best, > Sebastian > > > > On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis > wrote: > > > > Greetings, > > > > Is there any way to force the MLPRegressor to make predictions in the > same value range as the training data? For example, if the training data > range between -5 and -9, I don't want the predictions to range between -820 > and -800. In fact, some times I get anti-correlated predictions, for > example between 800 and 820 and I have to change the sign in order to > calculate correlations with experimental values. Is there a way to control > the value range explicitly or implicitly (by post-processing the > predictions)? > > > > thanks > > Thomas > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Sep 10 17:08:58 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 10 Sep 2017 17:08:58 -0400 Subject: [scikit-learn] control value range of MLPRegressor predictions In-Reply-To: References: <72A94197-A1AF-4D57-BD6B-68941043AF5B@gmail.com> Message-ID: <2B92AEBE-CD27-4723-B829-2A201946D0C0@gmail.com> With clipping, I mean thresholding the output, e.g., via sth like min/max(some_constant, actual_output) or like in an leaky relu: min/max(some_constant * 0.001, actual_output) Alternatively, you could use an sigmoidal function (something like tanh but with a larger co-domain) as the output unit, but I am not sure the MLPRegressor allows that. In that case, you probably want to implement the MLP regressor yourself (e.g., via TensorFlow or PyTorch) to have some room for experimentation with your output units. Best, Sebastian > On Sep 10, 2017, at 4:43 PM, Thomas Evangelidis wrote: > > > > On 10 September 2017 at 22:03, Sebastian Raschka wrote: > You could normalize the outputs (e.g., via min-max scaling). However, I think the more intuitive way would be to clip the predictions. E.g., say you are predicting house prices, it probably makes no sense to have a negative prediction, so you would clip the output at some value >0$ > > > ?By clipping you mean discarding the predictors that give values below/above the threshold? > > > PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to -9 range. Is your training data from a different population then the one you use for testing/making predictions? Or maybe it's just an extreme case of overfitting. > > > ?It is from the same population, but the training sets I use are very small (6-32 observations), so it must be over-fitting. We had that discussion in the past here, yet in practice I get good correlations with the experimental values using MLPRegressors.? > > > Best, > Sebastian > > > > On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis wrote: > > > > Greetings, > > > > Is there any way to force the MLPRegressor to make predictions in the same value range as the training data? For example, if the training data range between -5 and -9, I don't want the predictions to range between -820 and -800. In fact, some times I get anti-correlated predictions, for example between 800 and 820 and I have to change the sign in order to calculate correlations with experimental values. Is there a way to control the value range explicitly or implicitly (by post-processing the predictions)? > > > > thanks > > Thomas > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From shane.grigsby at colorado.edu Sun Sep 10 20:45:12 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Sun, 10 Sep 2017 18:45:12 -0600 Subject: [scikit-learn] how to make result less number of group with NearestNeighbors? In-Reply-To: References: Message-ID: <20170911004512.geso3opovitcqpy7@espgs-MacBook-Pro.local> I think you want to call the radius_neighbors method (check here: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors.radius_neighbors) (you're using kneighbors, replace with radius_neighbors) ~Shane On 09/10, Martin Lee wrote: > nbrs = NearestNeighbors(n_neighbors=10,radius=100.0,metric='euclidean',algorithm='ball_tree').fit(testing1) > distances, indices = nbrs.kneighbors(testing1) > >just expect when each point distance less than 100 then group into one group > > >Martin >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From joel.nothman at gmail.com Sun Sep 10 21:41:35 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 11 Sep 2017 11:41:35 +1000 Subject: [scikit-learn] how to make result less number of group with NearestNeighbors? In-Reply-To: References: Message-ID: Given your related post on the issue tracker, I think you're trying to perform clustering. Use DBSCAN, which is a standard approach to clustering based on neighborhoods within radius. On 10 September 2017 at 14:44, Martin Lee wrote: > nbrs = NearestNeighbors(n_neighbors=10,radius=100.0,metric='euclide > an',algorithm='ball_tree').fit(testing1) > distances, indices = nbrs.kneighbors(testing1) > > just expect when each point distance less than 100 then group into one > group > > > Martin > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Mon Sep 11 18:13:08 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 12 Sep 2017 00:13:08 +0200 Subject: [scikit-learn] custom loss function Message-ID: Greetings, I know this is a recurrent question, but I would like to use my own loss function either in a MLPRegressor or in an SVR. For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. On the other hand, for the SVR I was looking at setting custom kernel functions. But I am not sure if this is the same thing. Could someone please clarify this to me? Finally, I read about the "scoring" parameter is cross-validation, but this is just to select a Regressor that has been trained already with the default loss function, so it would be harder to find one that minimizes my own loss function. For the record, my loss function is the centered root mean square error. Thanks in advance for any advice. -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Sep 11 18:37:09 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 11 Sep 2017 18:37:09 -0400 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: Hi Thomas, > For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. Also, I suspect that this would be non-trivial. I haven't looked to closely at how the MLPClassifier/MLPRegressor are implemented but since you perform the weight updates based on the gradient of the cost function wrt the weights, the modification would be non-trivial if the partial derivatives are not computed based on some autodiff implementation -- you would have to edit all the partial d's along the backpropagation up to the first hidden layer. While I think that scikit-learn is by far the best library out there for machine learning, I think if you want an easy solution, you probably won't get around TensorFlow or PyTorch or equivalent, here, for your specific MLP problem unless you want to make your life extra hard :P (seriously, you can pick up any of the two in about an hour and have your MLPRegressor up and running so that you can then experiment with your cost function). Best, Sebastian > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis wrote: > > Greetings, > > I know this is a recurrent question, but I would like to use my own loss function either in a MLPRegressor or in an SVR. For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. On the other hand, for the SVR I was looking at setting custom kernel functions. But I am not sure if this is the same thing. Could someone please clarify this to me? Finally, I read about the "scoring" parameter is cross-validation, but this is just to select a Regressor that has been trained already with the default loss function, so it would be harder to find one that minimizes my own loss function. > > For the record, my loss function is the centered root mean square error. > > Thanks in advance for any advice. > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ryanmackenzieconway at gmail.com Tue Sep 12 16:01:33 2017 From: ryanmackenzieconway at gmail.com (Ryan Conway) Date: Tue, 12 Sep 2017 13:01:33 -0700 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search Message-ID: Hello, I'm wondering if sklearn provides a means of terminating pipelines with a NearestNeighbors search. For example, my workflow is DictVectorizer -> TfidfTransformer -> NearestNeighbors. I'd like to capture this in an sklearn Pipeline. Unfortunately, Pipeline does not expose a kneighbors() method that would run all intermediate transforms and then return the result of NearestNeighbors.kneighbors(). I went through Pipeline's source and noticed its decision_function(), predict() etc. all capture this functionality with different terminating operation names. Maybe there is some way to specify the terminating operation method name rather than relying on these Pipeline methods? Thank you, Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.atasever at gmail.com Wed Sep 13 03:37:02 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Wed, 13 Sep 2017 10:37:02 +0300 Subject: [scikit-learn] Accessing Clustering Feature Tree in Birch In-Reply-To: References: Message-ID: Dear Roman, I tried to search through on the web but i didn't get any information or example. Could you give me an example of using _CFNode.centroids_? I would appreciate it if you would help me. On Wed, Aug 23, 2017 at 2:28 PM, Roman Yurchak wrote: > > what are the data samples in this cluster > > Mehmet's response below works for exploring the hierarchical tree. > However, Birch currently doesn't store the data samples that belong to a > given subcluster. If you need that, as far as I know, a reasonable > approximation can be obtained by computing the data samples that are > closest to the centroid of the considered subcluster (accessible via > _CFNode.centroids_) as compared to all other subcluster centroids at this > hierarchical tree depth. > > Alternatively, the modifications in PR https://github.com/scikit-lear > n/scikit-learn/pull/8808 aimed to make this process easier.. > -- > Roman > > > On 23/08/17 13:44, Suzen, Mehmet wrote: > >> Hi Sema, >> >> You can access CFNode from the fit output, assign fit output, so you >> can have the object. >> >> brc_fit = brc.fit(X) >> brc_fit_cfnode = brc_fit.root_ >> >> >> Then you can access CFNode, see here >> https://kite.com/docs/python/sklearn.cluster.birch._CFNode >> >> Also, this example comparing mini batch kmeans. >> http://scikit-learn.org/stable/auto_examples/cluster/plot_ >> birch_vs_minibatchkmeans.html >> >> Hope this was what you are after. >> >> Best, >> Mehmet >> >> On 23 August 2017 at 10:55, Sema Atasever wrote: >> >>> Dear scikit-learn members, >>> >>> Considering the "CF-tree" data structure : >>> >>> - How can i access Clustering Feature Tree in Birch? >>> >>> - For example, how many clusters are there in the hierarchy under the >>> root >>> node and what are the data samples in this cluster? >>> >>> - Can I get them separately for 3 trees? >>> >>> Best. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Sep 13 05:25:29 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Wed, 13 Sep 2017 11:25:29 +0200 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, but now it's in my immediate plans. What about the SVR? Is it possible to change the loss function there? Could you please clarify what the "x" and "x'" parameters in the default Kernel functions mean? Is "x" a NxM array, where N is the number of observations and M the number of features? http://scikit-learn.org/stable/modules/svm.html#kernel-functions On 12 September 2017 at 00:37, Sebastian Raschka wrote: > Hi Thomas, > > > For the MLPRegressor case so far my conclusion was that it is not > possible unless you modify the source code. > > Also, I suspect that this would be non-trivial. I haven't looked to > closely at how the MLPClassifier/MLPRegressor are implemented but since you > perform the weight updates based on the gradient of the cost function wrt > the weights, the modification would be non-trivial if the partial > derivatives are not computed based on some autodiff implementation -- you > would have to edit all the partial d's along the backpropagation up to the > first hidden layer. While I think that scikit-learn is by far the best > library out there for machine learning, I think if you want an easy > solution, you probably won't get around TensorFlow or PyTorch or > equivalent, here, for your specific MLP problem unless you want to make > your life extra hard :P (seriously, you can pick up any of the two in about > an hour and have your MLPRegressor up and running so that you can then > experiment with your cost function). > > Best, > Sebastian > > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis > wrote: > > > > Greetings, > > > > I know this is a recurrent question, but I would like to use my own loss > function either in a MLPRegressor or in an SVR. For the MLPRegressor case > so far my conclusion was that it is not possible unless you modify the > source code. On the other hand, for the SVR I was looking at setting custom > kernel functions. But I am not sure if this is the same thing. Could > someone please clarify this to me? Finally, I read about the "scoring" > parameter is cross-validation, but this is just to select a Regressor that > has been trained already with the default loss function, so it would be > harder to find one that minimizes my own loss function. > > > > For the record, my loss function is the centered root mean square error. > > > > Thanks in advance for any advice. > > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Sep 13 12:14:25 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 13 Sep 2017 12:14:25 -0400 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: > What about the SVR? Is it possible to change the loss function there? Here you would have the same problem; SVR is a constrained optimization problem and you would have to change the calculation of the loss gradient then. Since SVR is a "1-layer" neural net, if you change the cost function to something else, it's not really a SVR anymore. > Could you please clarify what the "x" and "x'" parameters in the default Kernel functions mean? Is "x" a NxM array, where N is the number of observations and M the number of features? Both x and x' should be denoting training examples. The kernel matrix is symmetric (N x N). Best, Sebastian > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis wrote: > > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, but now it's in my immediate plans. > What about the SVR? Is it possible to change the loss function there? Could you please clarify what the "x" and "x'" parameters in the default Kernel functions mean? Is "x" a NxM array, where N is the number of observations and M the number of features? > > http://scikit-learn.org/stable/modules/svm.html#kernel-functions > > > > On 12 September 2017 at 00:37, Sebastian Raschka wrote: > Hi Thomas, > > > For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. > > Also, I suspect that this would be non-trivial. I haven't looked to closely at how the MLPClassifier/MLPRegressor are implemented but since you perform the weight updates based on the gradient of the cost function wrt the weights, the modification would be non-trivial if the partial derivatives are not computed based on some autodiff implementation -- you would have to edit all the partial d's along the backpropagation up to the first hidden layer. While I think that scikit-learn is by far the best library out there for machine learning, I think if you want an easy solution, you probably won't get around TensorFlow or PyTorch or equivalent, here, for your specific MLP problem unless you want to make your life extra hard :P (seriously, you can pick up any of the two in about an hour and have your MLPRegressor up and running so that you can then experiment with your cost function). > > Best, > Sebastian > > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis wrote: > > > > Greetings, > > > > I know this is a recurrent question, but I would like to use my own loss function either in a MLPRegressor or in an SVR. For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. On the other hand, for the SVR I was looking at setting custom kernel functions. But I am not sure if this is the same thing. Could someone please clarify this to me? Finally, I read about the "scoring" parameter is cross-validation, but this is just to select a Regressor that has been trained already with the default loss function, so it would be harder to find one that minimizes my own loss function. > > > > For the record, my loss function is the centered root mean square error. > > > > Thanks in advance for any advice. > > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Wed Sep 13 13:18:39 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Wed, 13 Sep 2017 19:18:39 +0200 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: ?? Thanks again for the clarifications Sebastian! Keras has a Scikit-learn API with the KeraRegressor which implements the Scikit-Learn MLPRegressor interface: https://keras.io/scikit-learn-api/ Is it possible to change the loss function in KerasRegressor? I don't have time right now to experiment with hyperparameters of new ANN architectures. I am in urgent need to reproduce in Keras the results obtained with MLPRegressor and the set of hyperparameters that I have optimized for my problem and later change the loss function. On 13 September 2017 at 18:14, Sebastian Raschka wrote: > > What about the SVR? Is it possible to change the loss function there? > > Here you would have the same problem; SVR is a constrained optimization > problem and you would have to change the calculation of the loss gradient > then. Since SVR is a "1-layer" neural net, if you change the cost function > to something else, it's not really a SVR anymore. > > > > Could you please clarify what the "x" and "x'" parameters in the default > Kernel functions mean? Is "x" a NxM array, where N is the number of > observations and M the number of features? > > Both x and x' should be denoting training examples. The kernel matrix is > symmetric (N x N). > > > > Best, > Sebastian > > > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis > wrote: > > > > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, > but now it's in my immediate plans. > > What about the SVR? Is it possible to change the loss function there? > Could you please clarify what the "x" and "x'" parameters in the default > Kernel functions mean? Is "x" a NxM array, where N is the number of > observations and M the number of features? > > > > http://scikit-learn.org/stable/modules/svm.html#kernel-functions > > > > > > > > On 12 September 2017 at 00:37, Sebastian Raschka > wrote: > > Hi Thomas, > > > > > For the MLPRegressor case so far my conclusion was that it is not > possible unless you modify the source code. > > > > Also, I suspect that this would be non-trivial. I haven't looked to > closely at how the MLPClassifier/MLPRegressor are implemented but since you > perform the weight updates based on the gradient of the cost function wrt > the weights, the modification would be non-trivial if the partial > derivatives are not computed based on some autodiff implementation -- you > would have to edit all the partial d's along the backpropagation up to the > first hidden layer. While I think that scikit-learn is by far the best > library out there for machine learning, I think if you want an easy > solution, you probably won't get around TensorFlow or PyTorch or > equivalent, here, for your specific MLP problem unless you want to make > your life extra hard :P (seriously, you can pick up any of the two in about > an hour and have your MLPRegressor up and running so that you can then > experiment with your cost function). > > > > Best, > > Sebastian > > > > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis > wrote: > > > > > > Greetings, > > > > > > I know this is a recurrent question, but I would like to use my own > loss function either in a MLPRegressor or in an SVR. For the MLPRegressor > case so far my conclusion was that it is not possible unless you modify the > source code. On the other hand, for the SVR I was looking at setting custom > kernel functions. But I am not sure if this is the same thing. Could > someone please clarify this to me? Finally, I read about the "scoring" > parameter is cross-validation, but this is just to select a Regressor that > has been trained already with the default loss function, so it would be > harder to find one that minimizes my own loss function. > > > > > > For the record, my loss function is the centered root mean square > error. > > > > > > Thanks in advance for any advice. > > > > > > > > > > > > -- > > > ====================================================================== > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Sep 13 13:49:26 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 13 Sep 2017 13:49:26 -0400 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: <0F4AAD19-D096-492B-B0BE-346DC231F365@gmail.com> > Is it possible to change the loss function in KerasRegressor? I don't have time right now to experiment with hyperparameters of new ANN architectures. I am in urgent need to reproduce in Keras the results obtained with MLPRegressor and the set of hyperparameters that I have optimized for my problem and later change the loss function Honestly, I don't have much experience with Keras. It may be easy to do that, I don't know. Alternatively, defining an MLP regressor in TensorFlow is not that hard and only few lines of code. E.g., you could copy the mlp classifier from (cell 4) here: https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/multilayer-perceptron-lowlevel.ipynb just delete the two last ops in the output layer out_act = tf.nn.softmax(out_z, name='predicted_probabilities') out_labels = tf.argmax(out_z, axis=1, name='predicted_labels')) and replace the loss/cost by tf.losses.mean_squared_error and you should have a MLP regressor running in a few lines of code. Then, you could experiment with your loss function by doing your own function. E.g., the usage is quite similar to what you do in NumPy, the mean_squared_error above can be manually defined as e.g., cost = tf.reduce_sum(tf.pow(pred-y 2))/(2*n_samples) Best, Sebastian > On Sep 13, 2017, at 1:18 PM, Thomas Evangelidis wrote: > > ?? > Thanks again for the clarifications Sebastian! > > Keras has a Scikit-learn API with the KeraRegressor which implements the Scikit-Learn MLPRegressor interface: > > https://keras.io/scikit-learn-api/ > > Is it possible to change the loss function in KerasRegressor? I don't have time right now to experiment with hyperparameters of new ANN architectures. I am in urgent need to reproduce in Keras the results obtained with MLPRegressor and the set of hyperparameters that I have optimized for my problem and later change the loss function. > > > > On 13 September 2017 at 18:14, Sebastian Raschka wrote: > > What about the SVR? Is it possible to change the loss function there? > > Here you would have the same problem; SVR is a constrained optimization problem and you would have to change the calculation of the loss gradient then. Since SVR is a "1-layer" neural net, if you change the cost function to something else, it's not really a SVR anymore. > > > > Could you please clarify what the "x" and "x'" parameters in the default Kernel functions mean? Is "x" a NxM array, where N is the number of observations and M the number of features? > > Both x and x' should be denoting training examples. The kernel matrix is symmetric (N x N). > > > > Best, > Sebastian > > > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis wrote: > > > > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, but now it's in my immediate plans. > > What about the SVR? Is it possible to change the loss function there? Could you please clarify what the "x" and "x'" parameters in the default Kernel functions mean? Is "x" a NxM array, where N is the number of observations and M the number of features? > > > > http://scikit-learn.org/stable/modules/svm.html#kernel-functions > > > > > > > > On 12 September 2017 at 00:37, Sebastian Raschka wrote: > > Hi Thomas, > > > > > For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. > > > > Also, I suspect that this would be non-trivial. I haven't looked to closely at how the MLPClassifier/MLPRegressor are implemented but since you perform the weight updates based on the gradient of the cost function wrt the weights, the modification would be non-trivial if the partial derivatives are not computed based on some autodiff implementation -- you would have to edit all the partial d's along the backpropagation up to the first hidden layer. While I think that scikit-learn is by far the best library out there for machine learning, I think if you want an easy solution, you probably won't get around TensorFlow or PyTorch or equivalent, here, for your specific MLP problem unless you want to make your life extra hard :P (seriously, you can pick up any of the two in about an hour and have your MLPRegressor up and running so that you can then experiment with your cost function). > > > > Best, > > Sebastian > > > > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis wrote: > > > > > > Greetings, > > > > > > I know this is a recurrent question, but I would like to use my own loss function either in a MLPRegressor or in an SVR. For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. On the other hand, for the SVR I was looking at setting custom kernel functions. But I am not sure if this is the same thing. Could someone please clarify this to me? Finally, I read about the "scoring" parameter is cross-validation, but this is just to select a Regressor that has been trained already with the default loss function, so it would be harder to find one that minimizes my own loss function. > > > > > > For the record, my loss function is the centered root mean square error. > > > > > > Thanks in advance for any advice. > > > > > > > > > > > > -- > > > ====================================================================== > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Wed Sep 13 14:45:41 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 13 Sep 2017 14:45:41 -0400 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search In-Reply-To: References: Message-ID: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> Hi Ryan. I don't think there's a good solution. Feel free to open an issue in the issue tracker (I'm not aware of one for this). You can access the pipeline steps, so you can access the kneighbors method via the "steps" attribute, but that wouldn't take any of the previous steps into account, and so you lose all the benefits of the pipeline. We could add a way to call non-standard methods, but I'm not sure that is the right way to go. (like pipeline.custom_method(X, method="kneighbors")). But that assumes that the method signature is X or (X, y). So I'm not sure if this is generally useful. Andy On 09/12/2017 04:01 PM, Ryan Conway wrote: > Hello, > > I'm wondering if sklearn provides a means of terminating pipelines > with a NearestNeighbors search. > > For example, my workflow is DictVectorizer -> TfidfTransformer -> > NearestNeighbors. I'd like to capture this in an sklearn Pipeline. > Unfortunately, Pipeline does not expose a kneighbors() method that > would run all intermediate transforms and then return the result of > NearestNeighbors.kneighbors(). > > I went through Pipeline's source and noticed its decision_function(), > predict() etc. all capture this functionality with different > terminating operation names. Maybe there is some way to specify the > terminating operation method name rather than relying on these > Pipeline methods? > > Thank you, > Ryan > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Sep 13 14:46:50 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 13 Sep 2017 14:46:50 -0400 Subject: [scikit-learn] custom loss function In-Reply-To: References: Message-ID: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: > ?? > Thanks again for the clarifications Sebastian! > > Keras has a Scikit-learn API with the KeraRegressor which implements > the Scikit-Learn MLPRegressor interface: > > https://keras.io/scikit-learn-api/ > > Is it possible to change the loss function in KerasRegressor? I don't > have time right now to experiment with hyperparameters of new ANN > architectures. I am in urgent need to reproduce in Keras the results > obtained with MLPRegressor and the set of hyperparameters that I have > optimized for my problem and later change the loss function. > I think using keras is probably the way to go for you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Sep 13 14:53:47 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 13 Sep 2017 20:53:47 +0200 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search In-Reply-To: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> References: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> Message-ID: <20170913185347.GE4066674@phare.normalesup.org> On Wed, Sep 13, 2017 at 02:45:41PM -0400, Andreas Mueller wrote: > We could add a way to call non-standard methods, but I'm not sure that is the > right way to go. > (like pipeline.custom_method(X, method="kneighbors")). But that assumes that > the method signature is X or (X, y). > So I'm not sure if this is generally useful. I don't see either why it's useful. We shouldn't add a method for everything that can be easily coded with a few lines of Python. The nice thing of Python is that it is such an expressive language. Ga?l From joel.nothman at gmail.com Wed Sep 13 17:14:49 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 14 Sep 2017 07:14:49 +1000 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search In-Reply-To: <20170913185347.GE4066674@phare.normalesup.org> References: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> <20170913185347.GE4066674@phare.normalesup.org> Message-ID: it's pretty easy to implement this by creating your own Pipeline subclass, isn't it? On 14 Sep 2017 4:55 am, "Gael Varoquaux" wrote: > On Wed, Sep 13, 2017 at 02:45:41PM -0400, Andreas Mueller wrote: > > We could add a way to call non-standard methods, but I'm not sure that > is the > > right way to go. > > (like pipeline.custom_method(X, method="kneighbors")). But that assumes > that > > the method signature is X or (X, y). > > So I'm not sure if this is generally useful. > > I don't see either why it's useful. We shouldn't add a method for > everything that can be easily coded with a few lines of Python. The nice > thing of Python is that it is such an expressive language. > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Sep 13 17:31:04 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Wed, 13 Sep 2017 23:31:04 +0200 Subject: [scikit-learn] custom loss function In-Reply-To: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> References: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> Message-ID: What about the SVM? I use an SVR at the end to combine multiple MLPRegressor predictions using the rbf kernel (linear kernel is not good for this problem). Can I also implement an SVR with rbf kernel in Tensorflow using my own loss function? So far I found an example of an SVC with linear kernel in Tensorflow and nothing in Keras. My alternative option would be to train multiple SVRs and find through cross validation the one that minimizes my custom loss function, but as I said in a previous message, that would be a suboptimal solution because in scikit-learn the SVR minimizes the default loss function. Dne 13. 9. 2017 20:48 napsal u?ivatel "Andreas Mueller" : > > > On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: > > ?? > Thanks again for the clarifications Sebastian! > > Keras has a Scikit-learn API with the KeraRegressor which implements the > Scikit-Learn MLPRegressor interface: > > https://keras.io/scikit-learn-api/ > > Is it possible to change the loss function in KerasRegressor? I don't have > time right now to experiment with hyperparameters of new ANN architectures. > I am in urgent need to reproduce in Keras the results obtained with > MLPRegressor and the set of hyperparameters that I have optimized for my > problem and later change the loss function. > > I think using keras is probably the way to go for you. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Wed Sep 13 17:51:40 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Wed, 13 Sep 2017 21:51:40 +0000 Subject: [scikit-learn] custom loss function In-Reply-To: References: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> Message-ID: You are confusing the kernel with the loss function. SVM minimize a well defined hinge loss on a space that's implicitly defined by a kernel mapping (or, in feature space if you use a linear kernel). On Wed, 13 Sep 2017 at 14:31 Thomas Evangelidis wrote: > What about the SVM? I use an SVR at the end to combine multiple > MLPRegressor predictions using the rbf kernel (linear kernel is not good > for this problem). Can I also implement an SVR with rbf kernel in > Tensorflow using my own loss function? So far I found an example of an SVC > with linear kernel in Tensorflow and nothing in Keras. My alternative > option would be to train multiple SVRs and find through cross validation > the one that minimizes my custom loss function, but as I said in a previous > message, that would be a suboptimal solution because in scikit-learn the > SVR minimizes the default loss function. > > Dne 13. 9. 2017 20:48 napsal u?ivatel "Andreas Mueller" >: > > >> >> On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: >> >> ?? >> Thanks again for the clarifications Sebastian! >> >> Keras has a Scikit-learn API with the KeraRegressor which implements the >> Scikit-Learn MLPRegressor interface: >> >> https://keras.io/scikit-learn-api/ >> >> Is it possible to change the loss function in KerasRegressor? I don't >> have time right now to experiment with hyperparameters of new ANN >> architectures. I am in urgent need to reproduce in Keras the results >> obtained with MLPRegressor and the set of hyperparameters that I have >> optimized for my problem and later change the loss function. >> >> I think using keras is probably the way to go for you. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Sep 13 18:20:32 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 14 Sep 2017 00:20:32 +0200 Subject: [scikit-learn] custom loss function In-Reply-To: References: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> Message-ID: I said that I want to make a Support Vector Regressor using the rbf kernel to minimize my own loss function. Never mentioned about classification and hinge loss. On 13 September 2017 at 23:51, federico vaggi wrote: > You are confusing the kernel with the loss function. SVM minimize a well > defined hinge loss on a space that's implicitly defined by a kernel mapping > (or, in feature space if you use a linear kernel). > > On Wed, 13 Sep 2017 at 14:31 Thomas Evangelidis wrote: > >> What about the SVM? I use an SVR at the end to combine multiple >> MLPRegressor predictions using the rbf kernel (linear kernel is not good >> for this problem). Can I also implement an SVR with rbf kernel in >> Tensorflow using my own loss function? So far I found an example of an SVC >> with linear kernel in Tensorflow and nothing in Keras. My alternative >> option would be to train multiple SVRs and find through cross validation >> the one that minimizes my custom loss function, but as I said in a previous >> message, that would be a suboptimal solution because in scikit-learn the >> SVR minimizes the default loss function. >> >> Dne 13. 9. 2017 20:48 napsal u?ivatel "Andreas Mueller" > >: >> >> >>> >>> On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: >>> >>> ?? >>> Thanks again for the clarifications Sebastian! >>> >>> Keras has a Scikit-learn API with the KeraRegressor which implements the >>> Scikit-Learn MLPRegressor interface: >>> >>> https://keras.io/scikit-learn-api/ >>> >>> Is it possible to change the loss function in KerasRegressor? I don't >>> have time right now to experiment with hyperparameters of new ANN >>> architectures. I am in urgent need to reproduce in Keras the results >>> obtained with MLPRegressor and the set of hyperparameters that I have >>> optimized for my problem and later change the loss function. >>> >>> I think using keras is probably the way to go for you. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Wed Sep 13 18:25:26 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Wed, 13 Sep 2017 22:25:26 +0000 Subject: [scikit-learn] custom loss function In-Reply-To: References: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> Message-ID: My bad, I looked at your question in the context of your 2nd e-mail in this topic where you talked about custom loss functions and SVR. On Wed, 13 Sep 2017 at 15:20 Thomas Evangelidis wrote: > I said that I want to make a Support Vector Regressor using the rbf kernel > to minimize my own loss function. Never mentioned about classification and > hinge loss. > > On 13 September 2017 at 23:51, federico vaggi > wrote: > >> You are confusing the kernel with the loss function. SVM minimize a well >> defined hinge loss on a space that's implicitly defined by a kernel mapping >> (or, in feature space if you use a linear kernel). >> >> On Wed, 13 Sep 2017 at 14:31 Thomas Evangelidis >> wrote: >> >>> What about the SVM? I use an SVR at the end to combine multiple >>> MLPRegressor predictions using the rbf kernel (linear kernel is not good >>> for this problem). Can I also implement an SVR with rbf kernel in >>> Tensorflow using my own loss function? So far I found an example of an SVC >>> with linear kernel in Tensorflow and nothing in Keras. My alternative >>> option would be to train multiple SVRs and find through cross validation >>> the one that minimizes my custom loss function, but as I said in a previous >>> message, that would be a suboptimal solution because in scikit-learn the >>> SVR minimizes the default loss function. >>> >>> Dne 13. 9. 2017 20:48 napsal u?ivatel "Andreas Mueller" < >>> t3kcit at gmail.com>: >>> >>> >>>> >>>> On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: >>>> >>>> ?? >>>> Thanks again for the clarifications Sebastian! >>>> >>>> Keras has a Scikit-learn API with the KeraRegressor which implements >>>> the Scikit-Learn MLPRegressor interface: >>>> >>>> https://keras.io/scikit-learn-api/ >>>> >>>> Is it possible to change the loss function in KerasRegressor? I don't >>>> have time right now to experiment with hyperparameters of new ANN >>>> architectures. I am in urgent need to reproduce in Keras the results >>>> obtained with MLPRegressor and the set of hyperparameters that I have >>>> optimized for my problem and later change the loss function. >>>> >>>> I think using keras is probably the way to go for you. >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.atasever at gmail.com Thu Sep 14 08:51:34 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Thu, 14 Sep 2017 15:51:34 +0300 Subject: [scikit-learn] Accessing Clustering Feature Tree in Birch Message-ID: Dear scikit-learn members, I have written about this subject before but I have not completely solved my question. - How can i *access Clustering Feature Tree* in Birch? - For example, how many clusters are there in the hierarchy under the *root node* and what are the data samples in this clusters? - Can I get them separately for 3 trees? Best. *Birch Implementation Code:* from sklearn.cluster import Birch from sklearn.externals import joblib import numpy as np import matplotlib.pyplot as plt from time import time from sklearn.datasets.samples_generator import make_blobs from sklearn.neighbors import NearestCentroid X=np.loadtxt(open("C:\dataset.txt", "rb"), delimiter=";") brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) birch_predict=brc.predict(X) print ("\nClustering_result:\n") print (birch_predict) np.savetxt('birch_predict_CLASS_0.csv', birch_predict,fmt="%i", delimiter=',') myroot = brc.root_ centroids = brc.subcluster_centers_ plt.plot(X[:,0], X[:,1], '+') plt.plot(centroids[:,0], centroids[:,1], 'o') plt.show() labels = brc.subcluster_labels_ n_clusters = np.unique(labels).size print("n_clusters : %d" % n_clusters + "\n") -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- 0.008442;0.66962;0.742786;0.12921;0.035335;0.5;0;0.751871;0.164742;0.083387;0.52606;0.426925;0.047015;0.760345;0.051905;0.18775;0.678724;0.181262;0.09258;0.527265;0.708205;0.09418 0.12921;0.755105;0.751871;0.164742;0.051905;0.047426;0;0.678724;0.181262;0.140013;0.52578;0.436165;0.038055;0.74779;0.09258;0.15963;0.666024;0.180252;0.09418;0.528065;0.717575;0.076115 0.164742;0.760345;0.678724;0.181262;0.09258;0.993307;0;0.666024;0.180252;0.153723;0.527265;0.431535;0.0412;0.708205;0.09418;0.197615;0.669414;0.167021;0.076115;0.528445;0.71082;0.06787 0.181262;0.74779;0.666024;0.180252;0.09418;0.047426;0.389599;0.669414;0.167021;0.163565;0.528065;0.409905;0.06203;0.717575;0.076115;0.20631;0.663989;0.155896;0.06787;0.52745;0.72124;0.073745 0.180252;0.708205;0.669414;0.167021;0.076115;0.017986;0;0.663989;0.155896;0.180117;0.528445;0.38415;0.087405;0.71082;0.06787;0.221315;0.66713;0.159766;0.073745;0.523975;0.73143;0.088295 0.167021;0.717575;0.663989;0.155896;0.06787;0.268941;0;0.66713;0.159766;0.173103;0.52745;0.389885;0.08267;0.72124;0.073745;0.20501;0.665781;0.1704;0.088295;0.510225;0.70722;0.083135 0.155896;0.71082;0.66713;0.159766;0.073745;0.5;0;0.665781;0.1704;0.163819;0.523975;0.406555;0.069475;0.73143;0.088295;0.18027;0.642032;0.165493;0.083135;0.49496;0.67203;0.1121 0.309835;0.614635;0.59243;0.24016;0.14647;0.119203;0;0.63434;0.194778;0.17088;0.61152;0.290565;0.09791;0.65716;0.09899;0.24385;0.709535;0.130572;0.044365;0.645405;0.734585;0.024905 0.24016;0.581995;0.63434;0.194778;0.09899;0.017986;0;0.709535;0.130572;0.159893;0.61952;0.256325;0.124155;0.691135;0.044365;0.2645;0.732647;0.113563;0.024905;0.664725;0.83201;0.02481 0.194778;0.65716;0.709535;0.130572;0.044365;0.017986;0;0.732647;0.113563;0.15379;0.645405;0.22476;0.129835;0.734585;0.024905;0.24051;0.771562;0.113287;0.02481;0.666795;0.745095;0.024735 0.130572;0.691135;0.732647;0.113563;0.024905;0.5;0;0.771562;0.113287;0.115153;0.664725;0.224025;0.111255;0.83201;0.02481;0.14318;0.723874;0.121635;0.024735;0.658105;0.70706;0.023255 0.113563;0.734585;0.771562;0.113287;0.02481;0.268941;0;0.723874;0.121635;0.154492;0.666795;0.220035;0.11317;0.745095;0.024735;0.23017;0.708299;0.116688;0.023255;0.63076;0.684905;0.017255 0.113287;0.83201;0.723874;0.121635;0.024735;0.5;0.854954;0.708299;0.116688;0.175015;0.658105;0.206675;0.13522;0.70706;0.023255;0.26969;0.468897;0.088995;0.017255;0.180385;0.455915;0.007055 0.121635;0.745095;0.708299;0.116688;0.023255;0.5;0;0.468897;0.088995;0.442108;0.63076;0.158705;0.210535;0.684905;0.017255;0.29784;0.231702;0.169084;0.007055;0.009225;0.027455;0.0043 0.035705;0.02416;0.072467;0.015195;0.01129;0.119203;0.262697;0.790346;0.01195;0.197704;0.769445;0.000195;0.23036;0.668065;0.00242;0.329515;0.793541;0.011307;0.000645;0.784085;0.7004;0.00229 0.015195;0.03245;0.790346;0.01195;0.00242;0.017986;0;0.793541;0.011307;0.19515;0.764935;0.00004;0.235025;0.68216;0.000645;0.31719;0.686992;0.007612;0.00229;0.97409;0.79086;0.08215 0.01195;0.668065;0.793541;0.011307;0.000645;0.006693;0;0.686992;0.007612;0.305397;0.784085;0.00002;0.215895;0.7004;0.00229;0.29731;0.907967;0.034313;0.08215;0.97517;0.688985;0.075555 0.011307;0.68216;0.686992;0.007612;0.00229;0.952574;0;0.907967;0.034313;0.057718;0.97409;0.000265;0.025645;0.79086;0.08215;0.126985;0.875077;0.031757;0.075555;0.97471;0.653285;0.061705 0.007612;0.7004;0.907967;0.034313;0.08215;0.5;0.667078;0.875077;0.031757;0.093164;0.97517;0.000255;0.024575;0.688985;0.075555;0.235455;0.863024;0.027104;0.061705;0.980635;0.70174;0.01938 0.034313;0.79086;0.875077;0.031757;0.075555;0.017986;0;0.863024;0.027104;0.109872;0.97471;0.000145;0.025145;0.653285;0.061705;0.28501;0.881151;0.012997;0.01938;0.97968;0.715565;0.0084 0.031757;0.688985;0.863024;0.027104;0.061705;0.119203;0;0.881151;0.012997;0.105856;0.980635;0.00015;0.01922;0.70174;0.01938;0.278885;0.885441;0.009437;0.0084;0.984365;0.72635;0.00049 0.027104;0.653285;0.881151;0.012997;0.01938;0.119203;0;0.885441;0.009437;0.105121;0.97968;0.00045;0.019865;0.715565;0.0084;0.276035;0.891236;0.006388;0.00049;0.993995;0.938085;0.006585 0.012997;0.70174;0.885441;0.009437;0.0084;0.047426;0;0.891236;0.006388;0.102376;0.984365;0.00017;0.015465;0.72635;0.00049;0.27316;0.965025;0.008656;0.006585;0.994325;0.961415;0.01333 0.009437;0.715565;0.891236;0.006388;0.00049;0.880797;0.667078;0.965025;0.008656;0.026319;0.993995;0.00088;0.005125;0.938085;0.006585;0.05533;0.972911;0.010983;0.01333;0.99423;0.973995;0.015275 0.006388;0.72635;0.965025;0.008656;0.006585;0.731059;0.948778;0.972911;0.010983;0.016106;0.994325;0.001115;0.00456;0.961415;0.01333;0.025255;0.977073;0.011641;0.015275;0.992885;0.97204;0.011895 0.008656;0.938085;0.972911;0.010983;0.01333;0.119203;0;0.977073;0.011641;0.011284;0.99423;0.001145;0.004625;0.973995;0.015275;0.010725;0.975973;0.010504;0.011895;0.98109;0.960235;0.01008 0.010983;0.961415;0.977073;0.011641;0.015275;0.5;0;0.975973;0.010504;0.013521;0.992885;0.001115;0.005995;0.97204;0.011895;0.016065;0.968106;0.009809;0.01008;0.905975;0.947335;0.00057 0.011641;0.973995;0.975973;0.010504;0.011895;0.5;0;0.968106;0.009809;0.022084;0.98109;0.000845;0.018065;0.960235;0.01008;0.029685;0.938041;0.00688;0.00057;0.629465;0.614855;0.00049 0.010504;0.97204;0.968106;0.009809;0.01008;0.5;0.488002;0.938041;0.00688;0.05508;0.905975;0.000475;0.09355;0.947335;0.00057;0.052095;0.59958;0.007675;0.00049;0.398225;0.054915;0.00193 0.009809;0.960235;0.938041;0.00688;0.00057;0.047426;0.223393;0.59958;0.007675;0.392743;0.629465;0.00032;0.37021;0.614855;0.00049;0.384655;0.163412;0.014925;0.00193;0.084205;0.0059;0.013055 0.326985;0.07661;0.051845;0.139195;0.186475;0.268941;0;0.06022;0.097127;0.842653;0.080545;0.043865;0.875595;0.039895;0.15039;0.80971;0.088763;0.016698;0.020645;0.091805;0.080505;0.02658 0.139195;0.083335;0.06022;0.097127;0.15039;0.268941;0.779714;0.088763;0.016698;0.894545;0.09449;0.01275;0.892765;0.083035;0.020645;0.896325;0.086155;0.042025;0.02658;0.09471;0.06764;0.31128 0.097127;0.039895;0.088763;0.016698;0.020645;0.017986;0;0.086155;0.042025;0.87182;0.091805;0.05747;0.850725;0.080505;0.02658;0.892915;0.081175;0.298045;0.31128;0.050345;0.02021;0.737805 0.495558;0.185695;0.093198;0.530914;0.220625;0.997527;0.781427;0.099843;0.516794;0.383363;0.099525;0.451045;0.44943;0.199435;0.100475;0.70009;0.098953;0.203845;0.02881;0.10351;0.217985;0.017295 0.530914;0.183815;0.099843;0.516794;0.100475;0.119203;0.64497;0.098953;0.203845;0.697199;0.09697;0.582155;0.32087;0.19932;0.02881;0.771865;0.107355;0.213066;0.017295;0.099415;0.22013;0.045165 0.516794;0.199435;0.098953;0.203845;0.02881;0.997527;0.457106;0.107355;0.213066;0.679579;0.10351;0.621335;0.275155;0.217985;0.017295;0.76472;0.106705;0.218615;0.045165;0.09363;0.26663;0.04146 0.203845;0.19932;0.107355;0.213066;0.017295;0.047426;0;0.106705;0.218615;0.674683;0.099415;0.61011;0.290475;0.22013;0.045165;0.73471;0.120276;0.457291;0.04146;0.07761;0.304405;0.03027 0.213066;0.217985;0.106705;0.218615;0.045165;0.119203;0.18954;0.120276;0.457291;0.422436;0.09363;0.33155;0.574825;0.26663;0.04146;0.691915;0.127528;0.058846;0.03027;0.026315;0.65796;0.00464 0.218615;0.22013;0.120276;0.457291;0.04146;0.047426;0.390788;0.127528;0.058846;0.813624;0.07761;0.1457;0.77669;0.304405;0.03027;0.66532;0.342137;0.008417;0.00464;0.03819;0.84422;0.00579 0.003138;0.84422;0.654379;0.003161;0.000425;0.017986;0.414595;0.783778;0.00473;0.211496;0.39544;0.012155;0.592415;0.95703;0.001465;0.041505;0.914397;0.003594;0.00043;0.809;0.99335;0.000095 0.003161;0.91128;0.783778;0.00473;0.001465;0.5;0.732433;0.914397;0.003594;0.082006;0.757605;0.009785;0.232605;0.98672;0.00043;0.012845;0.933738;0.002311;0.000095;0.877555;0.997475;0.00003 0.00473;0.95703;0.914397;0.003594;0.00043;0.006693;0.799472;0.933738;0.002311;0.063946;0.809;0.00627;0.18472;0.99335;0.000095;0.00655;0.957965;0.002491;0.00003;0.94554;0.99902;0.000005 0.003594;0.98672;0.933738;0.002311;0.000095;0.006693;0.478014;0.957965;0.002491;0.039544;0.877555;0.006875;0.11557;0.997475;0.00003;0.002495;0.981142;0.003436;0.000005;0.958705;0.999125;0 0.002311;0.99335;0.957965;0.002491;0.00003;0.047426;0.39556;0.981142;0.003436;0.015424;0.94554;0.009735;0.044725;0.99902;0.000005;0.00098;0.985565;0.005279;0;0.95914;0.999945;0 0.002491;0.997475;0.981142;0.003436;0.000005;0.993307;0.907795;0.985565;0.005279;0.009157;0.958705;0.01527;0.02603;0.999125;0;0.000875;0.985984;0.005669;0;0.960575;0.999935;0 0.003436;0.99902;0.985565;0.005279;0;0.017986;0.695508;0.985984;0.005669;0.008347;0.95914;0.01644;0.02442;0.999945;0;0.000055;0.986459;0.005656;0;0.961995;0.99982;0 0.005279;0.999125;0.985984;0.005669;0;0.047426;0.925048;0.986459;0.005656;0.007886;0.960575;0.0164;0.023025;0.999935;0;0.000065;0.986894;0.005554;0;0.95715;0.99988;0 0.005669;0.999945;0.986459;0.005656;0;0.880797;0.319733;0.986894;0.005554;0.007552;0.961995;0.016095;0.02191;0.99982;0;0.00018;0.985299;0.005352;0;0.90035;0.999795;0 0.005656;0.999935;0.986894;0.005554;0;0.268941;0.863303;0.985299;0.005352;0.009349;0.95715;0.01549;0.02736;0.99988;0;0.00012;0.966326;0.005159;0;0.805855;0.999615;0 0.005554;0.99982;0.985299;0.005352;0;0.047426;0.601807;0.966326;0.005159;0.028516;0.90035;0.014895;0.08476;0.999795;0;0.000205;0.934732;0.003761;0;0.606475;0.99777;0.00001 0.003039;0.99777;0.670359;0.001451;0.00001;0.268941;0.586375;0.629746;0.000693;0.369563;0.011315;0.000355;0.988335;0.88131;0.00003;0.11866;0.23484;0.001218;0.000075;0.00434;0.43038;0.00258 0.001451;0.99192;0.629746;0.000693;0.00003;0.119203;0.433889;0.23484;0.001218;0.763943;0.00538;0.00105;0.99357;0.6911;0.000075;0.30883;0.219774;0.076059;0.00258;0.003685;0.10887;0.00114 0.000693;0.88131;0.23484;0.001218;0.000075;0.047426;0.458099;0.219774;0.076059;0.704164;0.00434;0.000995;0.99466;0.43038;0.00258;0.56704;0.112386;0.085661;0.00114;0.002;0.03147;0.002315 0.032934;0.22018;0.25514;0.028839;0.01624;0.119203;0.7191;0.603796;0.011174;0.38503;0.29032;0.02698;0.682695;0.56421;0.0017;0.434095;0.923163;0.003005;0.000175;0.889325;0.984285;0.000035 0.028839;0.349925;0.603796;0.011174;0.0017;0.047426;0.334033;0.923163;0.003005;0.073832;0.84408;0.004025;0.151895;0.943165;0.000175;0.05666;0.9378;0.002788;0.000035;0.955635;0.99905;0.00001 0.011174;0.56421;0.923163;0.003005;0.000175;0.5;0.629483;0.9378;0.002788;0.059412;0.889325;0.003655;0.10702;0.984285;0.000035;0.01568;0.948102;0.002001;0.00001;0.98219;0.999775;0.000005 0.003005;0.943165;0.9378;0.002788;0.000035;0.952574;0.584676;0.948102;0.002001;0.049899;0.955635;0.005635;0.03873;0.99905;0.00001;0.000945;0.957786;0.004236;0.000005;0.98412;0.999975;0 0.002788;0.984285;0.948102;0.002001;0.00001;0.952574;0.905081;0.957786;0.004236;0.037978;0.98219;0.01235;0.00546;0.999775;0.000005;0.00022;0.952973;0.012087;0;0.984635;0.999955;0 0.002001;0.99905;0.957786;0.004236;0.000005;0.017986;0.340964;0.952973;0.012087;0.03494;0.98412;0.013755;0.002125;0.999975;0;0.000025;0.951057;0.014627;0;0.98458;0.99991;0 0.004236;0.999775;0.952973;0.012087;0;0.880797;0.873028;0.951057;0.014627;0.034314;0.984635;0.01438;0.00098;0.999955;0;0.000045;0.950876;0.015671;0;0.98404;0.99996;0 0.012087;0.999975;0.951057;0.014627;0;0.880797;0.697622;0.950876;0.015671;0.033453;0.98458;0.014395;0.001025;0.99991;0;0.00009;0.960394;0.005822;0;0.982855;0.999905;0 0.014627;0.999955;0.950876;0.015671;0;0.952574;0.905253;0.960394;0.005822;0.033783;0.98404;0.014125;0.001835;0.99996;0;0.00004;0.957887;0.005889;0;0.97646;0.99974;0 0.015671;0.99991;0.960394;0.005822;0;0.047426;0.712795;0.957887;0.005889;0.036224;0.982855;0.013745;0.0034;0.999905;0;0.000095;0.956875;0.004227;0;0.955485;0.999375;0.000005 0.005822;0.99996;0.957887;0.005889;0;0.5;0.78279;0.956875;0.004227;0.038896;0.97646;0.008745;0.01479;0.99974;0;0.00026;0.92999;0.003049;0.000005;0.938535;0.99311;0.000005 0.005889;0.999905;0.956875;0.004227;0;0.119203;0.579568;0.92999;0.003049;0.066959;0.955485;0.005195;0.03932;0.999375;0.000005;0.000615;0.942257;0.002056;0.000005;0.87656;0.911585;0 0.004227;0.99974;0.92999;0.003049;0.000005;0.268941;0.759511;0.942257;0.002056;0.055687;0.938535;0.00228;0.059185;0.99311;0.000005;0.006885;0.745626;0.015394;0;0.653185;0.171825;0.00004 0.003049;0.999375;0.942257;0.002056;0.000005;0.119203;0.872917;0.745626;0.015394;0.23898;0.87656;0.00142;0.12202;0.911585;0;0.088415;0.309117;0.141822;0.00004;0.55808;0.00044;0.000785 0.144274;0.04801;0.146957;0.084234;0.11681;0.119203;0.450414;0.536664;0.04495;0.418386;0.195625;0.079095;0.72528;0.568975;0.04854;0.382485;0.64845;0.021258;0.03926;0.39226;0.90174;0.05751 0.084234;0.17759;0.536664;0.04495;0.04854;0.017986;0.147795;0.64845;0.021258;0.33029;0.311835;0.01742;0.670745;0.73025;0.03926;0.230485;0.761245;0.036791;0.05751;0.52422;0.929955;0.06594 0.04495;0.568975;0.64845;0.021258;0.03926;0.119203;0.746872;0.761245;0.036791;0.20196;0.39226;0.045655;0.562085;0.90174;0.05751;0.04074;0.815355;0.049582;0.06594;0.536225;0.930545;0.066465 0.021258;0.73025;0.761245;0.036791;0.05751;0.982014;0.609212;0.815355;0.049582;0.135061;0.52422;0.075575;0.400205;0.929955;0.06594;0.0041;0.818375;0.056409;0.066465;0.5574;0.930585;0.067005 0.036791;0.90174;0.815355;0.049582;0.06594;0.017986;0.608021;0.818375;0.056409;0.125216;0.536225;0.09544;0.368335;0.930545;0.066465;0.00299;0.826736;0.093697;0.067005;0.61885;0.930695;0.066395 0.049582;0.929955;0.818375;0.056409;0.066465;0.047426;0.207839;0.826736;0.093697;0.079569;0.5574;0.20671;0.23589;0.930585;0.067005;0.002415;0.847252;0.12085;0.066395;0.619115;0.928225;0.06536 0.056409;0.930545;0.826736;0.093697;0.067005;0.982014;0.896135;0.847252;0.12085;0.031899;0.61885;0.28877;0.092385;0.930695;0.066395;0.00291;0.788514;0.118014;0.06536;0.575755;0.919285;0.046905 0.093697;0.930585;0.847252;0.12085;0.066395;0.731059;0.721115;0.788514;0.118014;0.093474;0.619115;0.28828;0.092605;0.928225;0.06536;0.00642;0.772007;0.109543;0.046905;0.51972;0.86882;0.01736 0.12085;0.930695;0.788514;0.118014;0.06536;0.017986;0.418971;0.772007;0.109543;0.118451;0.575755;0.28133;0.142915;0.919285;0.046905;0.033815;0.734065;0.097927;0.01736;0.45373;0.409985;0.00429 0.118014;0.928225;0.772007;0.109543;0.046905;0.119203;0.375194;0.734065;0.097927;0.168009;0.51972;0.276015;0.204265;0.86882;0.01736;0.11382;0.347599;0.093625;0.00429;0.32548;0.002525;0.003945 0.109543;0.919285;0.734065;0.097927;0.01736;0.268941;0.080394;0.347599;0.093625;0.558776;0.45373;0.275445;0.270825;0.409985;0.00429;0.585725;0.162941;0.128565;0.003945;0.304175;0.00075;0.19228 0.015575;0.128615;0.18182;0.010233;0.006035;0.006693;0.628083;0.509804;0.002916;0.48728;0.564515;0.000295;0.43519;0.82232;0.000115;0.177565;0.836858;0.003037;0.00001;0.843905;0.956825;0.000005 0.010233;0.131695;0.509804;0.002916;0.000115;0.119203;0.665299;0.836858;0.003037;0.160105;0.642435;0.00077;0.356795;0.89494;0.00001;0.10505;0.924991;0.001083;0.000005;0.851295;0.999195;0 0.002916;0.82232;0.836858;0.003037;0.00001;0.993307;0.4985;0.924991;0.001083;0.073921;0.843905;0.00277;0.15332;0.956825;0.000005;0.04316;0.941577;0.000813;0;0.919575;0.999835;0 0.003037;0.89494;0.924991;0.001083;0.000005;0.119203;0.606828;0.941577;0.000813;0.057608;0.851295;0.001965;0.14674;0.999195;0;0.0008;0.964536;0.000653;0;0.952645;0.99983;0 0.001083;0.956825;0.941577;0.000813;0;0.017986;0.601088;0.964536;0.000653;0.034808;0.919575;0.001485;0.07893;0.999835;0;0.000165;0.975557;0.001343;0;0.95212;0.999935;0 0.000813;0.999195;0.964536;0.000653;0;0.047426;0.514996;0.975557;0.001343;0.023099;0.952645;0.003555;0.0438;0.99983;0;0.00017;0.977389;0.003606;0;0.94496;0.999805;0 0.000653;0.999835;0.975557;0.001343;0;0.880797;0.218744;0.977389;0.003606;0.019006;0.95212;0.010345;0.03754;0.999935;0;0.000065;0.974858;0.006419;0;0.937955;0.99909;0 0.001343;0.99983;0.977389;0.003606;0;0.047426;0.347964;0.974858;0.006419;0.018723;0.94496;0.01178;0.04326;0.999805;0;0.000195;0.973184;0.004053;0;0.90462;0.997515;0 0.003606;0.999935;0.974858;0.006419;0;0.047426;0.49725;0.973184;0.004053;0.022763;0.937955;0.011675;0.05037;0.99909;0;0.00091;0.960937;0.003776;0;0.799285;0.986475;0.000035 0.006419;0.999805;0.973184;0.004053;0;0.731059;0.426046;0.960937;0.003776;0.035285;0.90462;0.010795;0.08458;0.997515;0;0.002485;0.928221;0.003408;0.000035;0.72453;0.894125;0.00006 0.004053;0.99909;0.960937;0.003776;0;0.047426;0.725916;0.928221;0.003408;0.068373;0.799285;0.00964;0.19108;0.986475;0.000035;0.01349;0.842942;0.003839;0.00006;0.698595;0.739345;0.00018 0.003776;0.997515;0.928221;0.003408;0.000035;0.047426;0.299643;0.842942;0.003839;0.153217;0.72453;0.010775;0.264695;0.894125;0.00006;0.10581;0.650625;0.012722;0.00018;0.663635;0.65696;0.008015 0.003408;0.986475;0.842942;0.003839;0.00006;0.5;0.536187;0.650625;0.012722;0.336655;0.698595;0.03711;0.2643;0.739345;0.00018;0.260475;0.505169;0.043735;0.008015;0.630895;0.39294;0.00277 0.072568;0.06769;0.021446;0.038434;0.07939;0.047426;0;0.348958;0.036144;0.614896;0.077135;0.00615;0.91671;0.021545;0.07638;0.902075;0.348864;0.030603;0.034935;0.11489;0.024195;0.162845 0.038434;0.020485;0.348958;0.036144;0.07638;0.119203;0.803924;0.348864;0.030603;0.620533;0.085065;0.027595;0.88734;0.020085;0.034935;0.94498;0.357251;0.112677;0.162845;0.11449;0.02295;0.32538 0.036144;0.021545;0.348864;0.030603;0.034935;0.119203;0;0.357251;0.112677;0.530072;0.11489;0.14152;0.743585;0.024195;0.162845;0.812965;0.349695;0.205096;0.32538;0.095495;0.01456;0.18781 0.402061;0.154955;0.200411;0.383701;0.146995;0.047426;0.335815;0.644193;0.058533;0.297273;0.413315;0.108035;0.478645;0.64462;0.004885;0.350495;0.647131;0.056199;0.005445;0.40603;0.673135;0.008125 0.383701;0.16909;0.644193;0.058533;0.004885;0.047426;0.165481;0.647131;0.056199;0.296669;0.416905;0.100475;0.48262;0.649845;0.005445;0.34471;0.65127;0.039951;0.008125;0.39438;0.667785;0.00753 0.058533;0.64462;0.647131;0.056199;0.005445;0.119203;0;0.65127;0.039951;0.308776;0.40603;0.04905;0.54492;0.673135;0.008125;0.31873;0.64146;0.035531;0.00753;0.376435;0.666245;0.00642 0.056199;0.649845;0.65127;0.039951;0.008125;0.731059;0.635758;0.64146;0.035531;0.323009;0.39438;0.03017;0.57545;0.667785;0.00753;0.324685;0.373052;0.032856;0.00642;0.34128;0.66615;0.030435 0;0;0.150847;0.094682;0.099565;0.006693;0.414353;0.849131;0.000118;0.150753;0.880315;0.000095;0.11959;0.882355;0.00012;0.11753;0.922273;0.000065;0.00003;0.992165;0.976785;0.00005 0.094682;0.02561;0.849131;0.000118;0.00012;0.731059;0.388648;0.922273;0.000065;0.077658;0.98756;0.00006;0.012375;0.956645;0.00003;0.04332;0.924201;0.00007;0.00005;0.99962;0.99817;0 0.000118;0.882355;0.922273;0.000065;0.00003;0.119203;0.513747;0.924201;0.00007;0.075727;0.992165;0.00007;0.007765;0.976785;0.00005;0.02316;0.957982;0.000043;0;0.99969;0.99949;0 0.000065;0.956645;0.924201;0.00007;0.00005;0.268941;0.348418;0.957982;0.000043;0.041973;0.99962;0.000055;0.00032;0.99817;0;0.00183;0.974686;0.000038;0;0.99974;0.999935;0 0.00007;0.976785;0.957982;0.000043;0;0.006693;0.649991;0.974686;0.000038;0.025273;0.99969;0.00005;0.00025;0.99949;0;0.00051;0.98024;0.000037;0;0.9996;0.99978;0 0.000043;0.99817;0.974686;0.000038;0;0.000911;0.53096;0.98024;0.000037;0.019725;0.99974;0.000055;0.00021;0.999935;0;0.000065;0.983794;0.000048;0;0.9994;0.99998;0 0.000038;0.99949;0.98024;0.000037;0;0.5;0.349327;0.983794;0.000048;0.016156;0.9996;0.000095;0.0003;0.99978;0;0.00022;0.990775;0.000049;0;0.99923;0.999985;0 0.000037;0.999935;0.983794;0.000048;0;0.047426;0.27768;0.990775;0.000049;0.009177;0.9994;0.0001;0.000505;0.99998;0;0.00002;0.99067;0.000062;0;0.99604;0.9999;0 0.000048;0.99978;0.990775;0.000049;0;0.119203;0.380365;0.99067;0.000062;0.009267;0.99923;0.00014;0.00063;0.999985;0;0.000015;0.986525;0.00013;0;0.986525;0.99898;0 0.000049;0.99998;0.99067;0.000062;0;0.880797;0.668853;0.986525;0.00013;0.013345;0.99604;0.000345;0.003615;0.9999;0;0.0001;0.983631;0.000132;0;0.96961;0.99821;0 0.000062;0.999985;0.986525;0.00013;0;0.002473;0.100833;0.983631;0.000132;0.016239;0.986525;0.00035;0.01313;0.99898;0;0.00102;0.971793;0.000128;0;0.950005;0.994645;0.000005 0.00013;0.9999;0.983631;0.000132;0;0.017986;0.557988;0.971793;0.000128;0.028079;0.96961;0.00034;0.03005;0.99821;0;0.00179;0.939341;0.001325;0.000005;0.7612;0.908295;0.000745 0.000132;0.99898;0.971793;0.000128;0;0.5;0.391741;0.939341;0.001325;0.059334;0.950005;0.00187;0.048125;0.994645;0.000005;0.00535;0.832202;0.010909;0.000745;0.20003;0.294375;0.000355 0.000128;0.99821;0.939341;0.001325;0.000005;0.268941;0.276678;0.832202;0.010909;0.156889;0.7612;0.00267;0.23613;0.908295;0.000745;0.09096;0.24415;0.012425;0.000355;0.12832;0.04852;0.00063 0.700558;0.00128;0.033528;0.51778;0.69314;0.119203;0.467048;0.957029;0.018677;0.024294;0.988125;0.00001;0.01186;0.9871;0.000125;0.01278;0.961403;0.011465;0;0.999395;0.99769;0.000005 0.51778;0.001365;0.957029;0.018677;0.000125;0.047426;0.316695;0.961403;0.011465;0.027132;0.99916;0.000005;0.000835;0.997235;0;0.002765;0.966611;0.011268;0.000005;0.999775;0.99896;0 0.018677;0.9871;0.961403;0.011465;0;0.017986;0.108515;0.966611;0.011268;0.022121;0.999395;0;0.000605;0.99769;0.000005;0.002305;0.97066;0.011267;0;0.99999;0.99999;0 0.011465;0.997235;0.966611;0.011268;0.000005;0.5;0.099302;0.97066;0.011267;0.018073;0.999775;0;0.000225;0.99896;0;0.00104;0.967128;0.010435;0;0.999975;0.999955;0 0.011268;0.99769;0.97066;0.011267;0;0.268941;0.567093;0.967128;0.010435;0.022437;0.99999;0;0.00001;0.99999;0;0.00001;0.96963;0.011402;0;0.99997;0.99994;0 0.011267;0.99896;0.967128;0.010435;0;0.119203;0.653169;0.96963;0.011402;0.018968;0.999975;0;0.000025;0.999955;0;0.000045;0.968817;0.008672;0;0.9999;0.999595;0 0.010435;0.99999;0.96963;0.011402;0;0.268941;0.797057;0.968817;0.008672;0.022511;0.99997;0;0.00003;0.99994;0;0.00006;0.952639;0.000037;0;0.99978;0.999795;0 0.011402;0.999955;0.968817;0.008672;0;0.731059;0.563653;0.952639;0.000037;0.047324;0.9999;0;0.0001;0.999595;0;0.000405;0.941544;0.000053;0;0.999345;0.99894;0 0.008672;0.99994;0.952639;0.000037;0;0.047426;0.613963;0.941544;0.000053;0.058401;0.99978;0;0.000215;0.999795;0;0.000205;0.953784;0.033575;0;0.99645;0.99539;0.00001 0.000037;0.999595;0.941544;0.000053;0;0.119203;0.529715;0.953784;0.033575;0.012639;0.999345;0.00004;0.00061;0.99894;0;0.00106;0.950305;0.037169;0.00001;0.992425;0.988895;0.00003 0.000053;0.999795;0.953784;0.033575;0;0.731059;0.488502;0.950305;0.037169;0.012526;0.99645;0.00008;0.003465;0.99539;0.00001;0.004605;0.953128;0.015558;0.00003;0.96765;0.981705;0.00012 0.033575;0.99894;0.950305;0.037169;0.00001;0.5;0.426536;0.953128;0.015558;0.031314;0.992425;0.000085;0.00749;0.988895;0.00003;0.011075;0.933835;0.016748;0.00012;0.82843;0.899765;0.000305 0.037169;0.99539;0.953128;0.015558;0.00003;0.119203;0.554285;0.933835;0.016748;0.049415;0.96765;0.00009;0.032255;0.981705;0.00012;0.018175;0.835794;0.015184;0.000305;0.151705;0.462245;0.000715 0.015558;0.988895;0.933835;0.016748;0.00012;0.5;0.60444;0.835794;0.015184;0.149021;0.82843;0.000105;0.171465;0.899765;0.000305;0.099925;0.312008;0.008636;0.000715;0.07423;0.025605;0.00077 0.016792;0.083015;0.117144;0.078285;0.04578;0.119203;0.546614;0.248611;0.003692;0.7477;0.21343;0.00002;0.786555;0.33779;0.00833;0.653885;0.393686;0.005201;0.00746;0.550185;0.461935;0.003405 0.078285;0.12086;0.248611;0.003692;0.00833;0.119203;0.455617;0.393686;0.005201;0.601113;0.38517;0;0.61483;0.4509;0.00746;0.54164;0.432389;0.001162;0.003405;0.888935;0.87477;0.00035 0.003692;0.33779;0.393686;0.005201;0.00746;0.119203;0.231831;0.432389;0.001162;0.566446;0.550185;0.00001;0.4498;0.461935;0.003405;0.534655;0.841113;0.000142;0.00035;0.97981;0.92331;0.00007 0.005201;0.4509;0.432389;0.001162;0.003405;0.5;0.280497;0.841113;0.000142;0.158745;0.888935;0.00001;0.111055;0.87477;0.00035;0.12488;0.900356;0.000046;0.00007;0.990275;0.97047;0.00004 0.001162;0.461935;0.841113;0.000142;0.00035;0.119203;0.523732;0.900356;0.000046;0.099599;0.97981;0.000005;0.020185;0.92331;0.00007;0.076625;0.968005;0.001144;0.00004;0.99949;0.99538;0.000115 0.000142;0.87477;0.900356;0.000046;0.00007;0.268941;0.439301;0.968005;0.001144;0.030853;0.990275;0.000005;0.009725;0.97047;0.00004;0.02949;0.993907;0.001241;0.000115;0.99965;0.996815;0.00004 0.000046;0.92331;0.968005;0.001144;0.00004;0.880797;0.350692;0.993907;0.001241;0.004852;0.99949;0;0.00051;0.99538;0.000115;0.004505;0.988958;0.001281;0.00004;0.99971;0.995905;0.000015 0.001144;0.97047;0.993907;0.001241;0.000115;0.5;0.329599;0.988958;0.001281;0.009761;0.99965;0;0.00035;0.996815;0.00004;0.003145;0.985293;0.000038;0.000015;0.999985;0.99454;0.00001 0.001241;0.99538;0.988958;0.001281;0.00004;0.119203;0.452642;0.985293;0.000038;0.014667;0.99971;0;0.000285;0.995905;0.000015;0.00408;0.979348;0.005637;0.00001;0.99995;0.995545;0.00001 0.001281;0.996815;0.985293;0.000038;0.000015;0.5;0.606828;0.979348;0.005637;0.015015;0.999985;0;0.000015;0.99454;0.00001;0.00545;0.994323;0.003886;0.00001;0.99995;0.99819;0 0.000038;0.995905;0.979348;0.005637;0.00001;0.5;0.719302;0.994323;0.003886;0.001791;0.99995;0;0.00005;0.995545;0.00001;0.004445;0.993189;0.000233;0;0.999355;0.999255;0 0.005637;0.99454;0.994323;0.003886;0.00001;0.268941;0.179167;0.993189;0.000233;0.006578;0.99995;0;0.00005;0.99819;0;0.00181;0.999115;0.000211;0;0.99927;0.99949;0 0.003886;0.995545;0.993189;0.000233;0;0.119203;0.67787;0.999115;0.000211;0.000674;0.999355;0;0.000645;0.999255;0;0.000745;0.998113;0.000171;0;0.99831;0.99966;0 0.000233;0.99819;0.999115;0.000211;0;0.731059;0;0.998113;0.000171;0.001716;0.99927;0.000005;0.000725;0.99949;0;0.00051;0.998331;0.000116;0;0.991755;0.99985;0 0.000211;0.999255;0.998113;0.000171;0;0.047426;0.286182;0.998331;0.000116;0.001553;0.99831;0.000005;0.001685;0.99966;0;0.00034;0.996304;0.000106;0;0.9818;0.99917;0 0.000171;0.99949;0.998331;0.000116;0;0.268941;0.441519;0.996304;0.000106;0.003589;0.991755;0.00001;0.008235;0.99985;0;0.00015;0.984398;0.000236;0;0.8794;0.99648;0.000005 0.000116;0.99966;0.996304;0.000106;0;0.119203;0.457354;0.984398;0.000236;0.015367;0.9818;0.000575;0.01763;0.99917;0;0.00083;0.94823;0.00062;0.000005;0.83004;0.981515;0.000025 0.000236;0.99917;0.94823;0.00062;0.000005;0.5;0.315831;0.916663;0.000691;0.082648;0.83004;0.002015;0.167945;0.981515;0.000025;0.018465;0.849844;0.000841;0.000045;0.60209;0.93436;0.009335 0.00062;0.99648;0.916663;0.000691;0.000025;0.119203;0.512497;0.849844;0.000841;0.149313;0.69651;0.002445;0.301045;0.943615;0.000045;0.056335;0.784729;0.00706;0.009335;0.550905;0.788395;0.01519 0.000691;0.981515;0.849844;0.000841;0.000045;0.119203;0.607305;0.784729;0.00706;0.208215;0.60209;0.01181;0.386105;0.93436;0.009335;0.05631;0.697286;0.014475;0.01519;0.00835;0.10455;0.00141 0.672462;0.005635;0.048684;0.372434;0.79551;0.5;0.023823;0.941118;0.006192;0.052691;0.9675;0.000015;0.032485;0.983405;0.00004;0.01656;0.955736;0.008144;0.000005;0.999815;0.999065;0 0.372434;0.006595;0.941118;0.006192;0.00004;0.047426;0.444974;0.955736;0.008144;0.036119;0.99976;0;0.000235;0.99842;0.000005;0.001575;0.976851;0.000012;0;0.99996;0.9999;0 0.006192;0.983405;0.955736;0.008144;0.000005;0.017986;0.199248;0.976851;0.000012;0.023137;0.999815;0.000005;0.00018;0.999065;0;0.000935;0.983451;0.000012;0;0.999985;0.99999;0 0.008144;0.99842;0.976851;0.000012;0;0.119203;0.057487;0.983451;0.000012;0.016538;0.99996;0.000005;0.000035;0.9999;0;0.0001;0.982751;0.000012;0;0.999985;0.99985;0 0.000012;0.999065;0.983451;0.000012;0;0.952574;0.597967;0.982751;0.000012;0.017236;0.999985;0.000005;0.000005;0.99999;0;0.00001;0.975726;0.000529;0;0.999965;0.999845;0 0.000012;0.9999;0.982751;0.000012;0;0.047426;0.574443;0.975726;0.000529;0.023745;0.999985;0.000005;0.00001;0.99985;0;0.00015;0.972878;0.000503;0;0.999435;0.999335;0 0.000012;0.99999;0.975726;0.000529;0;0.5;0.597967;0.972878;0.000503;0.026617;0.999965;0.000005;0.000025;0.999845;0;0.000155;0.967218;0.002922;0;0.99919;0.998885;0 0.000529;0.99985;0.972878;0.000503;0;0.880797;0.389361;0.967218;0.002922;0.029857;0.999435;0.000005;0.00055;0.999335;0;0.000665;0.946015;0.009153;0;0.998625;0.988535;0.00001 0.000503;0.999845;0.967218;0.002922;0;0.268941;0.563899;0.946015;0.009153;0.044832;0.99919;0.000005;0.000805;0.998885;0;0.001115;0.926613;0.003178;0.00001;0.99058;0.968105;0.00004 0.002922;0.999335;0.946015;0.009153;0;0.047426;0.58686;0.926613;0.003178;0.070208;0.998625;0.00002;0.001355;0.988535;0.00001;0.01145;0.853032;0.000042;0.00004;0.381295;0.802785;0.00148 0.009153;0.998885;0.926613;0.003178;0.00001;0.880797;0.279086;0.853032;0.000042;0.146931;0.99058;0.000025;0.0094;0.968105;0.00004;0.031865;0.558403;0.000559;0.00148;0.23099;0.6761;0.000085 0.002453;0.6761;0.256732;0.000116;0.00013;0.017986;0.44695;0.407903;0.000456;0.591639;0.19305;0.00006;0.806885;0.54496;0.00067;0.45437;0.658868;0.001987;0.00152;0.9086;0.75571;0.002275 0.000116;0.42864;0.407903;0.000456;0.00067;0.047426;0.516494;0.658868;0.001987;0.339146;0.579265;0.00031;0.420425;0.581085;0.00152;0.417395;0.787854;0.002991;0.002275;0.96037;0.92668;0.00197 0.000456;0.54496;0.658868;0.001987;0.00152;0.268941;0.347057;0.787854;0.002991;0.209154;0.9086;0.00226;0.089145;0.75571;0.002275;0.24201;0.881451;0.001828;0.00197;0.986125;0.96213;0.00024 0.002991;0.75571;0.881451;0.001828;0.00197;0.017986;0.350464;0.819756;0.00041;0.179832;0.986125;0.000085;0.013785;0.96213;0.00024;0.03763;0.807821;0.000242;0.000225;0.99317;0.98375;0.00066 0.001828;0.92668;0.819756;0.00041;0.00024;0.731059;0.45289;0.807821;0.000242;0.191934;0.9916;0.000265;0.008125;0.96689;0.000225;0.032885;0.833145;0.000286;0.00066;0.99834;0.985635;0.00004 0.00041;0.96213;0.807821;0.000242;0.000225;0.880797;0.282925;0.833145;0.000286;0.166569;0.99317;0.000075;0.006755;0.98375;0.00066;0.01559;0.933461;0.000049;0.00004;0.998295;0.99341;0.000005 0.000242;0.96689;0.833145;0.000286;0.00066;0.268941;0.233438;0.933461;0.000049;0.066492;0.99834;0.00004;0.001625;0.985635;0.00004;0.014325;0.948341;0.000038;0.000005;0.99838;0.997545;0 0.000038;0.99341;0.973207;0.000044;0;0.731059;0.676558;0.97968;0.00275;0.017572;0.998635;0.00019;0.001175;0.9994;0.000005;0.0006;0.97611;0.002973;0;0.998365;0.999885;0 0.000044;0.997545;0.97968;0.00275;0.000005;0.047426;0.498;0.97611;0.002973;0.020915;0.998615;0.000385;0.000995;0.99987;0;0.00013;0.980911;0.00034;0;0.996925;0.99964;0 0.00275;0.9994;0.97611;0.002973;0;0.5;0.634599;0.980911;0.00034;0.018749;0.998365;0.00099;0.000645;0.999885;0;0.000115;0.983061;0.000367;0;0.99072;0.997965;0 0.002973;0.99987;0.980911;0.00034;0;0.952574;0.843169;0.983061;0.000367;0.01657;0.996925;0.00107;0.001995;0.99964;0;0.00036;0.979397;0.000427;0;0.981765;0.996365;0.00001 0.00034;0.999885;0.983061;0.000367;0;0.5;0.093638;0.979397;0.000427;0.020175;0.99072;0.00125;0.008025;0.997965;0;0.002035;0.974694;0.00077;0.00001;0.94478;0.98166;0.00056 0.000367;0.99964;0.979397;0.000427;0;0.047426;0.39556;0.974694;0.00077;0.024536;0.981765;0.00227;0.015965;0.996365;0.00001;0.003625;0.907345;0.002525;0.00056;0.81602;0.90559;0.000965 0.000427;0.997965;0.974694;0.00077;0.00001;0.952574;0.42678;0.907345;0.002525;0.090128;0.94478;0.002605;0.052615;0.98166;0.00056;0.017775;0.806528;0.002708;0.000965;0.37423;0.168285;0.003255 0.00077;0.996365;0.907345;0.002525;0.00056;0.952574;0.149822;0.806528;0.002708;0.190764;0.81602;0.00275;0.18123;0.90559;0.000965;0.093445;0.25032;0.005192;0.003255;0.194985;0.056635;0.00331 0.509818;0.000275;0.05059;0.37672;0.54454;0.731059;0.282114;0.951709;0.024367;0.023924;0.980945;0.00005;0.019005;0.997815;0.00003;0.002155;0.95633;0.001666;0.000005;0.995795;0.999305;0 0.37672;0.00048;0.951709;0.024367;0.00003;0.047426;0.459837;0.95633;0.001666;0.042002;0.995415;0;0.004585;0.999065;0.000005;0.000925;0.976598;0.001666;0;0.99991;0.999895;0 0.024367;0.997815;0.95633;0.001666;0.000005;0.002473;0.037652;0.976598;0.001666;0.021734;0.995795;0.000005;0.004195;0.999305;0;0.000695;0.982954;0.003358;0;0.999945;0.999985;0 0.001666;0.999065;0.976598;0.001666;0;0.268941;0.284144;0.982954;0.003358;0.013688;0.99991;0.00002;0.00007;0.999895;0;0.000105;0.983102;0.002203;0;0.999865;0.999905;0 0.001666;0.999305;0.982954;0.003358;0;0.5;0.378246;0.983102;0.002203;0.01469;0.999945;0.00002;0.00003;0.999985;0;0.000005;0.975464;0.002224;0;0.999865;0.999855;0 0.003358;0.999895;0.983102;0.002203;0;0.268941;0.476268;0.975464;0.002224;0.02231;0.999865;0.00002;0.000115;0.999905;0;0.00009;0.969589;0.000018;0;0.998255;0.998425;0.00001 0.002203;0.999985;0.975464;0.002224;0;0.880797;0.693174;0.969589;0.000018;0.030394;0.999865;0.000025;0.000115;0.999855;0;0.000145;0.94434;0.002803;0.00001;0.99444;0.993315;0.000145 0.002224;0.999905;0.969589;0.000018;0;0.5;0.529217;0.94434;0.002803;0.052856;0.998255;0.000025;0.001715;0.998425;0.00001;0.001565;0.933727;0.006593;0.000145;0.99246;0.961665;0.0002 0.000018;0.999855;0.94434;0.002803;0.00001;0.047426;0.41096;0.933727;0.006593;0.059682;0.99444;0.000025;0.005535;0.993315;0.000145;0.006545;0.911364;0.002948;0.0002;0.495955;0.625965;0.00081 0.002803;0.998425;0.933727;0.006593;0.000145;0.880797;0.443246;0.911364;0.002948;0.085688;0.99246;0.00002;0.00752;0.961665;0.0002;0.038135;0.53995;0.000964;0.00081;0.000715;0.021055;0.000915 0.000021;0.977485;0.955565;0.000027;0.00002;0.731059;0.643136;0.981244;0.000015;0.018738;0.996045;0.00001;0.00394;0.998985;0;0.00101;0.980455;0.000014;0;0.997085;0.999905;0 0.000027;0.98628;0.981244;0.000015;0;0.268941;0.287614;0.980455;0.000014;0.019531;0.997015;0.00001;0.002975;0.999835;0;0.000165;0.977148;0.000026;0;0.9991;0.99988;0 0.000015;0.998985;0.980455;0.000014;0;0.268941;0.61822;0.977148;0.000026;0.022826;0.997085;0.000045;0.002875;0.999905;0;0.00009;0.985975;0.000036;0;0.998265;0.99845;0 0.000014;0.999835;0.977148;0.000026;0;0.880797;0.532952;0.985975;0.000036;0.013989;0.9991;0.000075;0.000825;0.99988;0;0.00012;0.979153;0.000048;0;0.99201;0.995045;0 0.000026;0.999905;0.985975;0.000036;0;0.119203;0.195761;0.979153;0.000048;0.020799;0.998265;0.00011;0.001625;0.99845;0;0.00155;0.965073;0.003846;0;0.962435;0.974465;0.000075 0.000036;0.99988;0.979153;0.000048;0;0.047426;0.537927;0.965073;0.003846;0.031079;0.99201;0.000105;0.00788;0.995045;0;0.004955;0.928092;0.000133;0.000075;0.90279;0.723185;0.01667 0.000048;0.99845;0.965073;0.003846;0;0.268941;0.536435;0.928092;0.000133;0.071776;0.962435;0.00029;0.03727;0.974465;0.000075;0.025465;0.789833;0.006586;0.01667;0.37088;0.27723;0.001685 0.003846;0.995045;0.928092;0.000133;0.000075;0.880797;0.675902;0.789833;0.006586;0.203578;0.90279;0.003055;0.09415;0.723185;0.01667;0.26014;0.251092;0.006111;0.001685;0.16336;0.06408;0.00334 0.655616;0.0014;0.028594;0.4833;0.96643;0.731059;0.123467;0.975345;0.001873;0.02278;0.989215;0.00001;0.010775;0.999055;0.00001;0.00093;0.974067;0.00476;0;0.99997;0.999835;0 0.4833;0.001525;0.975345;0.001873;0.00001;0.006693;0.393649;0.974067;0.00476;0.021172;0.999815;0;0.000185;0.999715;0;0.000285;0.976068;0.007189;0;0.999995;0.999965;0 0.001873;0.999055;0.974067;0.00476;0;0.006693;0.255213;0.976068;0.007189;0.016743;0.99997;0;0.00003;0.999835;0;0.000165;0.973135;0.001278;0;0.99999;0.999985;0 0.00476;0.999715;0.976068;0.007189;0;0.5;0.163693;0.973135;0.001278;0.025587;0.999995;0;0.000005;0.999965;0;0.000035;0.97794;0.002564;0;0.99996;0.99998;0 0.007189;0.999835;0.973135;0.001278;0;0.982014;0.511248;0.97794;0.002564;0.019495;0.99999;0;0.000005;0.999985;0;0.000015;0.978694;0.000011;0;0.999955;0.999965;0 0.001278;0.999965;0.97794;0.002564;0;0.268941;0.742691;0.978694;0.000011;0.021295;0.99996;0;0.00004;0.99998;0;0.00002;0.97844;0.000012;0;0.99834;0.99947;0 0.002564;0.999985;0.978694;0.000011;0;0.047426;0.529964;0.97844;0.000012;0.021546;0.999955;0;0.00004;0.999965;0;0.000035;0.949995;0.00235;0;0.9983;0.999175;0 0.000011;0.99998;0.97844;0.000012;0;0.880797;0.321475;0.949995;0.00235;0.047653;0.99834;0.000015;0.00164;0.99947;0;0.00053;0.937838;0.008506;0;0.99788;0.964215;0.000005 0.000012;0.999965;0.949995;0.00235;0;0.731059;0.579812;0.937838;0.008506;0.053654;0.9983;0.000015;0.00168;0.999175;0;0.000825;0.917462;0.003603;0.000005;0.995875;0.941665;0.00006 0.00235;0.99947;0.937838;0.008506;0;0.047426;0.528968;0.917462;0.003603;0.078933;0.99788;0.000095;0.00202;0.964215;0.000005;0.03578;0.88201;0.00159;0.00006;0.78624;0.905205;0.000245 From markus.konrad at wzb.eu Thu Sep 14 10:10:52 2017 From: markus.konrad at wzb.eu (Markus Konrad) Date: Thu, 14 Sep 2017 16:10:52 +0200 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? Message-ID: Hi there, I'm trying out sklearn's latent Dirichlet allocation implementation for topic modeling. The code from the official example [1] works just fine and the extracted topics look reasonable. However, when I try other corpora, for example the Gutenberg corpus from NLTK, most of the extracted topics are garbage. See this example output, when trying to get 30 topics: Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane (301.83) Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother (55.27) Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles (166.21) Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) fatiguing (0.01) ... Many topics tend to have the same weights, all equal to the `topic_word_prior` parameter. This is my script: import nltk from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += " ".join([feature_names[i] + " (" + str(round(topic[i], 2)) + ")" for i in topic.argsort()[:-n_top_words - 1:-1]]) print(message) data_samples = [nltk.corpus.gutenberg.raw(f_id) for f_id in nltk.corpus.gutenberg.fileids()] tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') tf = tf_vectorizer.fit_transform(data_samples) lda = LatentDirichletAllocation(n_components=30, learning_method='batch', n_jobs=-1, # all CPUs verbose=1, evaluate_every=10, max_iter=1000, doc_topic_prior=0.1, topic_word_prior=0.01, random_state=1) lda.fit(tf) tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, 5) Is there a problem in how I set up the LatentDirichletAllocation instance or pass the data? I tried out different parameter settings, but none of them provided good results for that corpus. I also tried out alternative implementations (like the lda package [2]) and those were able to find reasonable topics. Best, Markus [1] http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py [2] http://pythonhosted.org/lda/ From shane.grigsby at colorado.edu Thu Sep 14 12:32:28 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Thu, 14 Sep 2017 10:32:28 -0600 Subject: [scikit-learn] Accessing Clustering Feature Tree in Birch In-Reply-To: References: Message-ID: <20170914163228.lqd6hvto363m5ybu@espgs-MacBook-Pro.local> I'd be interested in hearing the answer to this as well, specifically if there's a standardized way in the API for dealing with nested hierarchical clusters (i.e., when 'b' and 'c' are child clusters totally contained within parent cluster 'a'). Perhaps there's a way to identify multiple clusters in the labels array (using decimals or imaginary numbers to refer to parents or children, or maybe allowing the labels array to be two dimensional)? Or a separate attribute array that deals with the hierarchy? Or maybe it makes sense to just rely on a separate function to extract up or down a level in the hierarchy instead of an attribute in the fitted clustering algorithm? ~Shane On 09/14, Sema Atasever wrote: >Dear scikit-learn members, > >I have written about this subject before but I have not completely solved >my question. > >- How can i *access Clustering Feature Tree* in Birch? > >- For example, how many clusters are there in the hierarchy under the *root >node* and what are the data samples in this clusters? > >- Can I get them separately for 3 trees? > >Best. > > >*Birch Implementation Code:* > >from sklearn.cluster import Birch >from sklearn.externals import joblib >import numpy as np >import matplotlib.pyplot as plt >from time import time >from sklearn.datasets.samples_generator import make_blobs >from sklearn.neighbors import NearestCentroid > >X=np.loadtxt(open("C:\dataset.txt", "rb"), delimiter=";") > >brc = Birch(branching_factor=50, n_clusters=None, >threshold=0.5,compute_labels=True,copy=True) > >brc.fit(X) > >birch_predict=brc.predict(X) >print ("\nClustering_result:\n") >print (birch_predict) > >np.savetxt('birch_predict_CLASS_0.csv', birch_predict,fmt="%i", >delimiter=',') > >myroot = brc.root_ > >centroids = brc.subcluster_centers_ >plt.plot(X[:,0], X[:,1], '+') >plt.plot(centroids[:,0], centroids[:,1], 'o') >plt.show() > >labels = brc.subcluster_labels_ >n_clusters = np.unique(labels).size >print("n_clusters : %d" % n_clusters + "\n") >0.008442;0.66962;0.742786;0.12921;0.035335;0.5;0;0.751871;0.164742;0.083387;0.52606;0.426925;0.047015;0.760345;0.051905;0.18775;0.678724;0.181262;0.09258;0.527265;0.708205;0.09418 >0.12921;0.755105;0.751871;0.164742;0.051905;0.047426;0;0.678724;0.181262;0.140013;0.52578;0.436165;0.038055;0.74779;0.09258;0.15963;0.666024;0.180252;0.09418;0.528065;0.717575;0.076115 >0.164742;0.760345;0.678724;0.181262;0.09258;0.993307;0;0.666024;0.180252;0.153723;0.527265;0.431535;0.0412;0.708205;0.09418;0.197615;0.669414;0.167021;0.076115;0.528445;0.71082;0.06787 >0.181262;0.74779;0.666024;0.180252;0.09418;0.047426;0.389599;0.669414;0.167021;0.163565;0.528065;0.409905;0.06203;0.717575;0.076115;0.20631;0.663989;0.155896;0.06787;0.52745;0.72124;0.073745 >0.180252;0.708205;0.669414;0.167021;0.076115;0.017986;0;0.663989;0.155896;0.180117;0.528445;0.38415;0.087405;0.71082;0.06787;0.221315;0.66713;0.159766;0.073745;0.523975;0.73143;0.088295 >0.167021;0.717575;0.663989;0.155896;0.06787;0.268941;0;0.66713;0.159766;0.173103;0.52745;0.389885;0.08267;0.72124;0.073745;0.20501;0.665781;0.1704;0.088295;0.510225;0.70722;0.083135 >0.155896;0.71082;0.66713;0.159766;0.073745;0.5;0;0.665781;0.1704;0.163819;0.523975;0.406555;0.069475;0.73143;0.088295;0.18027;0.642032;0.165493;0.083135;0.49496;0.67203;0.1121 >0.309835;0.614635;0.59243;0.24016;0.14647;0.119203;0;0.63434;0.194778;0.17088;0.61152;0.290565;0.09791;0.65716;0.09899;0.24385;0.709535;0.130572;0.044365;0.645405;0.734585;0.024905 >0.24016;0.581995;0.63434;0.194778;0.09899;0.017986;0;0.709535;0.130572;0.159893;0.61952;0.256325;0.124155;0.691135;0.044365;0.2645;0.732647;0.113563;0.024905;0.664725;0.83201;0.02481 >0.194778;0.65716;0.709535;0.130572;0.044365;0.017986;0;0.732647;0.113563;0.15379;0.645405;0.22476;0.129835;0.734585;0.024905;0.24051;0.771562;0.113287;0.02481;0.666795;0.745095;0.024735 >0.130572;0.691135;0.732647;0.113563;0.024905;0.5;0;0.771562;0.113287;0.115153;0.664725;0.224025;0.111255;0.83201;0.02481;0.14318;0.723874;0.121635;0.024735;0.658105;0.70706;0.023255 >0.113563;0.734585;0.771562;0.113287;0.02481;0.268941;0;0.723874;0.121635;0.154492;0.666795;0.220035;0.11317;0.745095;0.024735;0.23017;0.708299;0.116688;0.023255;0.63076;0.684905;0.017255 >0.113287;0.83201;0.723874;0.121635;0.024735;0.5;0.854954;0.708299;0.116688;0.175015;0.658105;0.206675;0.13522;0.70706;0.023255;0.26969;0.468897;0.088995;0.017255;0.180385;0.455915;0.007055 >0.121635;0.745095;0.708299;0.116688;0.023255;0.5;0;0.468897;0.088995;0.442108;0.63076;0.158705;0.210535;0.684905;0.017255;0.29784;0.231702;0.169084;0.007055;0.009225;0.027455;0.0043 >0.035705;0.02416;0.072467;0.015195;0.01129;0.119203;0.262697;0.790346;0.01195;0.197704;0.769445;0.000195;0.23036;0.668065;0.00242;0.329515;0.793541;0.011307;0.000645;0.784085;0.7004;0.00229 >0.015195;0.03245;0.790346;0.01195;0.00242;0.017986;0;0.793541;0.011307;0.19515;0.764935;0.00004;0.235025;0.68216;0.000645;0.31719;0.686992;0.007612;0.00229;0.97409;0.79086;0.08215 >0.01195;0.668065;0.793541;0.011307;0.000645;0.006693;0;0.686992;0.007612;0.305397;0.784085;0.00002;0.215895;0.7004;0.00229;0.29731;0.907967;0.034313;0.08215;0.97517;0.688985;0.075555 >0.011307;0.68216;0.686992;0.007612;0.00229;0.952574;0;0.907967;0.034313;0.057718;0.97409;0.000265;0.025645;0.79086;0.08215;0.126985;0.875077;0.031757;0.075555;0.97471;0.653285;0.061705 >0.007612;0.7004;0.907967;0.034313;0.08215;0.5;0.667078;0.875077;0.031757;0.093164;0.97517;0.000255;0.024575;0.688985;0.075555;0.235455;0.863024;0.027104;0.061705;0.980635;0.70174;0.01938 >0.034313;0.79086;0.875077;0.031757;0.075555;0.017986;0;0.863024;0.027104;0.109872;0.97471;0.000145;0.025145;0.653285;0.061705;0.28501;0.881151;0.012997;0.01938;0.97968;0.715565;0.0084 >0.031757;0.688985;0.863024;0.027104;0.061705;0.119203;0;0.881151;0.012997;0.105856;0.980635;0.00015;0.01922;0.70174;0.01938;0.278885;0.885441;0.009437;0.0084;0.984365;0.72635;0.00049 >0.027104;0.653285;0.881151;0.012997;0.01938;0.119203;0;0.885441;0.009437;0.105121;0.97968;0.00045;0.019865;0.715565;0.0084;0.276035;0.891236;0.006388;0.00049;0.993995;0.938085;0.006585 >0.012997;0.70174;0.885441;0.009437;0.0084;0.047426;0;0.891236;0.006388;0.102376;0.984365;0.00017;0.015465;0.72635;0.00049;0.27316;0.965025;0.008656;0.006585;0.994325;0.961415;0.01333 >0.009437;0.715565;0.891236;0.006388;0.00049;0.880797;0.667078;0.965025;0.008656;0.026319;0.993995;0.00088;0.005125;0.938085;0.006585;0.05533;0.972911;0.010983;0.01333;0.99423;0.973995;0.015275 >0.006388;0.72635;0.965025;0.008656;0.006585;0.731059;0.948778;0.972911;0.010983;0.016106;0.994325;0.001115;0.00456;0.961415;0.01333;0.025255;0.977073;0.011641;0.015275;0.992885;0.97204;0.011895 >0.008656;0.938085;0.972911;0.010983;0.01333;0.119203;0;0.977073;0.011641;0.011284;0.99423;0.001145;0.004625;0.973995;0.015275;0.010725;0.975973;0.010504;0.011895;0.98109;0.960235;0.01008 >0.010983;0.961415;0.977073;0.011641;0.015275;0.5;0;0.975973;0.010504;0.013521;0.992885;0.001115;0.005995;0.97204;0.011895;0.016065;0.968106;0.009809;0.01008;0.905975;0.947335;0.00057 >0.011641;0.973995;0.975973;0.010504;0.011895;0.5;0;0.968106;0.009809;0.022084;0.98109;0.000845;0.018065;0.960235;0.01008;0.029685;0.938041;0.00688;0.00057;0.629465;0.614855;0.00049 >0.010504;0.97204;0.968106;0.009809;0.01008;0.5;0.488002;0.938041;0.00688;0.05508;0.905975;0.000475;0.09355;0.947335;0.00057;0.052095;0.59958;0.007675;0.00049;0.398225;0.054915;0.00193 >0.009809;0.960235;0.938041;0.00688;0.00057;0.047426;0.223393;0.59958;0.007675;0.392743;0.629465;0.00032;0.37021;0.614855;0.00049;0.384655;0.163412;0.014925;0.00193;0.084205;0.0059;0.013055 >0.326985;0.07661;0.051845;0.139195;0.186475;0.268941;0;0.06022;0.097127;0.842653;0.080545;0.043865;0.875595;0.039895;0.15039;0.80971;0.088763;0.016698;0.020645;0.091805;0.080505;0.02658 >0.139195;0.083335;0.06022;0.097127;0.15039;0.268941;0.779714;0.088763;0.016698;0.894545;0.09449;0.01275;0.892765;0.083035;0.020645;0.896325;0.086155;0.042025;0.02658;0.09471;0.06764;0.31128 >0.097127;0.039895;0.088763;0.016698;0.020645;0.017986;0;0.086155;0.042025;0.87182;0.091805;0.05747;0.850725;0.080505;0.02658;0.892915;0.081175;0.298045;0.31128;0.050345;0.02021;0.737805 >0.495558;0.185695;0.093198;0.530914;0.220625;0.997527;0.781427;0.099843;0.516794;0.383363;0.099525;0.451045;0.44943;0.199435;0.100475;0.70009;0.098953;0.203845;0.02881;0.10351;0.217985;0.017295 >0.530914;0.183815;0.099843;0.516794;0.100475;0.119203;0.64497;0.098953;0.203845;0.697199;0.09697;0.582155;0.32087;0.19932;0.02881;0.771865;0.107355;0.213066;0.017295;0.099415;0.22013;0.045165 >0.516794;0.199435;0.098953;0.203845;0.02881;0.997527;0.457106;0.107355;0.213066;0.679579;0.10351;0.621335;0.275155;0.217985;0.017295;0.76472;0.106705;0.218615;0.045165;0.09363;0.26663;0.04146 >0.203845;0.19932;0.107355;0.213066;0.017295;0.047426;0;0.106705;0.218615;0.674683;0.099415;0.61011;0.290475;0.22013;0.045165;0.73471;0.120276;0.457291;0.04146;0.07761;0.304405;0.03027 >0.213066;0.217985;0.106705;0.218615;0.045165;0.119203;0.18954;0.120276;0.457291;0.422436;0.09363;0.33155;0.574825;0.26663;0.04146;0.691915;0.127528;0.058846;0.03027;0.026315;0.65796;0.00464 >0.218615;0.22013;0.120276;0.457291;0.04146;0.047426;0.390788;0.127528;0.058846;0.813624;0.07761;0.1457;0.77669;0.304405;0.03027;0.66532;0.342137;0.008417;0.00464;0.03819;0.84422;0.00579 >0.003138;0.84422;0.654379;0.003161;0.000425;0.017986;0.414595;0.783778;0.00473;0.211496;0.39544;0.012155;0.592415;0.95703;0.001465;0.041505;0.914397;0.003594;0.00043;0.809;0.99335;0.000095 >0.003161;0.91128;0.783778;0.00473;0.001465;0.5;0.732433;0.914397;0.003594;0.082006;0.757605;0.009785;0.232605;0.98672;0.00043;0.012845;0.933738;0.002311;0.000095;0.877555;0.997475;0.00003 >0.00473;0.95703;0.914397;0.003594;0.00043;0.006693;0.799472;0.933738;0.002311;0.063946;0.809;0.00627;0.18472;0.99335;0.000095;0.00655;0.957965;0.002491;0.00003;0.94554;0.99902;0.000005 >0.003594;0.98672;0.933738;0.002311;0.000095;0.006693;0.478014;0.957965;0.002491;0.039544;0.877555;0.006875;0.11557;0.997475;0.00003;0.002495;0.981142;0.003436;0.000005;0.958705;0.999125;0 >0.002311;0.99335;0.957965;0.002491;0.00003;0.047426;0.39556;0.981142;0.003436;0.015424;0.94554;0.009735;0.044725;0.99902;0.000005;0.00098;0.985565;0.005279;0;0.95914;0.999945;0 >0.002491;0.997475;0.981142;0.003436;0.000005;0.993307;0.907795;0.985565;0.005279;0.009157;0.958705;0.01527;0.02603;0.999125;0;0.000875;0.985984;0.005669;0;0.960575;0.999935;0 >0.003436;0.99902;0.985565;0.005279;0;0.017986;0.695508;0.985984;0.005669;0.008347;0.95914;0.01644;0.02442;0.999945;0;0.000055;0.986459;0.005656;0;0.961995;0.99982;0 >0.005279;0.999125;0.985984;0.005669;0;0.047426;0.925048;0.986459;0.005656;0.007886;0.960575;0.0164;0.023025;0.999935;0;0.000065;0.986894;0.005554;0;0.95715;0.99988;0 >0.005669;0.999945;0.986459;0.005656;0;0.880797;0.319733;0.986894;0.005554;0.007552;0.961995;0.016095;0.02191;0.99982;0;0.00018;0.985299;0.005352;0;0.90035;0.999795;0 >0.005656;0.999935;0.986894;0.005554;0;0.268941;0.863303;0.985299;0.005352;0.009349;0.95715;0.01549;0.02736;0.99988;0;0.00012;0.966326;0.005159;0;0.805855;0.999615;0 >0.005554;0.99982;0.985299;0.005352;0;0.047426;0.601807;0.966326;0.005159;0.028516;0.90035;0.014895;0.08476;0.999795;0;0.000205;0.934732;0.003761;0;0.606475;0.99777;0.00001 >0.003039;0.99777;0.670359;0.001451;0.00001;0.268941;0.586375;0.629746;0.000693;0.369563;0.011315;0.000355;0.988335;0.88131;0.00003;0.11866;0.23484;0.001218;0.000075;0.00434;0.43038;0.00258 >0.001451;0.99192;0.629746;0.000693;0.00003;0.119203;0.433889;0.23484;0.001218;0.763943;0.00538;0.00105;0.99357;0.6911;0.000075;0.30883;0.219774;0.076059;0.00258;0.003685;0.10887;0.00114 >0.000693;0.88131;0.23484;0.001218;0.000075;0.047426;0.458099;0.219774;0.076059;0.704164;0.00434;0.000995;0.99466;0.43038;0.00258;0.56704;0.112386;0.085661;0.00114;0.002;0.03147;0.002315 >0.032934;0.22018;0.25514;0.028839;0.01624;0.119203;0.7191;0.603796;0.011174;0.38503;0.29032;0.02698;0.682695;0.56421;0.0017;0.434095;0.923163;0.003005;0.000175;0.889325;0.984285;0.000035 >0.028839;0.349925;0.603796;0.011174;0.0017;0.047426;0.334033;0.923163;0.003005;0.073832;0.84408;0.004025;0.151895;0.943165;0.000175;0.05666;0.9378;0.002788;0.000035;0.955635;0.99905;0.00001 >0.011174;0.56421;0.923163;0.003005;0.000175;0.5;0.629483;0.9378;0.002788;0.059412;0.889325;0.003655;0.10702;0.984285;0.000035;0.01568;0.948102;0.002001;0.00001;0.98219;0.999775;0.000005 >0.003005;0.943165;0.9378;0.002788;0.000035;0.952574;0.584676;0.948102;0.002001;0.049899;0.955635;0.005635;0.03873;0.99905;0.00001;0.000945;0.957786;0.004236;0.000005;0.98412;0.999975;0 >0.002788;0.984285;0.948102;0.002001;0.00001;0.952574;0.905081;0.957786;0.004236;0.037978;0.98219;0.01235;0.00546;0.999775;0.000005;0.00022;0.952973;0.012087;0;0.984635;0.999955;0 >0.002001;0.99905;0.957786;0.004236;0.000005;0.017986;0.340964;0.952973;0.012087;0.03494;0.98412;0.013755;0.002125;0.999975;0;0.000025;0.951057;0.014627;0;0.98458;0.99991;0 >0.004236;0.999775;0.952973;0.012087;0;0.880797;0.873028;0.951057;0.014627;0.034314;0.984635;0.01438;0.00098;0.999955;0;0.000045;0.950876;0.015671;0;0.98404;0.99996;0 >0.012087;0.999975;0.951057;0.014627;0;0.880797;0.697622;0.950876;0.015671;0.033453;0.98458;0.014395;0.001025;0.99991;0;0.00009;0.960394;0.005822;0;0.982855;0.999905;0 >0.014627;0.999955;0.950876;0.015671;0;0.952574;0.905253;0.960394;0.005822;0.033783;0.98404;0.014125;0.001835;0.99996;0;0.00004;0.957887;0.005889;0;0.97646;0.99974;0 >0.015671;0.99991;0.960394;0.005822;0;0.047426;0.712795;0.957887;0.005889;0.036224;0.982855;0.013745;0.0034;0.999905;0;0.000095;0.956875;0.004227;0;0.955485;0.999375;0.000005 >0.005822;0.99996;0.957887;0.005889;0;0.5;0.78279;0.956875;0.004227;0.038896;0.97646;0.008745;0.01479;0.99974;0;0.00026;0.92999;0.003049;0.000005;0.938535;0.99311;0.000005 >0.005889;0.999905;0.956875;0.004227;0;0.119203;0.579568;0.92999;0.003049;0.066959;0.955485;0.005195;0.03932;0.999375;0.000005;0.000615;0.942257;0.002056;0.000005;0.87656;0.911585;0 >0.004227;0.99974;0.92999;0.003049;0.000005;0.268941;0.759511;0.942257;0.002056;0.055687;0.938535;0.00228;0.059185;0.99311;0.000005;0.006885;0.745626;0.015394;0;0.653185;0.171825;0.00004 >0.003049;0.999375;0.942257;0.002056;0.000005;0.119203;0.872917;0.745626;0.015394;0.23898;0.87656;0.00142;0.12202;0.911585;0;0.088415;0.309117;0.141822;0.00004;0.55808;0.00044;0.000785 >0.144274;0.04801;0.146957;0.084234;0.11681;0.119203;0.450414;0.536664;0.04495;0.418386;0.195625;0.079095;0.72528;0.568975;0.04854;0.382485;0.64845;0.021258;0.03926;0.39226;0.90174;0.05751 >0.084234;0.17759;0.536664;0.04495;0.04854;0.017986;0.147795;0.64845;0.021258;0.33029;0.311835;0.01742;0.670745;0.73025;0.03926;0.230485;0.761245;0.036791;0.05751;0.52422;0.929955;0.06594 >0.04495;0.568975;0.64845;0.021258;0.03926;0.119203;0.746872;0.761245;0.036791;0.20196;0.39226;0.045655;0.562085;0.90174;0.05751;0.04074;0.815355;0.049582;0.06594;0.536225;0.930545;0.066465 >0.021258;0.73025;0.761245;0.036791;0.05751;0.982014;0.609212;0.815355;0.049582;0.135061;0.52422;0.075575;0.400205;0.929955;0.06594;0.0041;0.818375;0.056409;0.066465;0.5574;0.930585;0.067005 >0.036791;0.90174;0.815355;0.049582;0.06594;0.017986;0.608021;0.818375;0.056409;0.125216;0.536225;0.09544;0.368335;0.930545;0.066465;0.00299;0.826736;0.093697;0.067005;0.61885;0.930695;0.066395 >0.049582;0.929955;0.818375;0.056409;0.066465;0.047426;0.207839;0.826736;0.093697;0.079569;0.5574;0.20671;0.23589;0.930585;0.067005;0.002415;0.847252;0.12085;0.066395;0.619115;0.928225;0.06536 >0.056409;0.930545;0.826736;0.093697;0.067005;0.982014;0.896135;0.847252;0.12085;0.031899;0.61885;0.28877;0.092385;0.930695;0.066395;0.00291;0.788514;0.118014;0.06536;0.575755;0.919285;0.046905 >0.093697;0.930585;0.847252;0.12085;0.066395;0.731059;0.721115;0.788514;0.118014;0.093474;0.619115;0.28828;0.092605;0.928225;0.06536;0.00642;0.772007;0.109543;0.046905;0.51972;0.86882;0.01736 >0.12085;0.930695;0.788514;0.118014;0.06536;0.017986;0.418971;0.772007;0.109543;0.118451;0.575755;0.28133;0.142915;0.919285;0.046905;0.033815;0.734065;0.097927;0.01736;0.45373;0.409985;0.00429 >0.118014;0.928225;0.772007;0.109543;0.046905;0.119203;0.375194;0.734065;0.097927;0.168009;0.51972;0.276015;0.204265;0.86882;0.01736;0.11382;0.347599;0.093625;0.00429;0.32548;0.002525;0.003945 >0.109543;0.919285;0.734065;0.097927;0.01736;0.268941;0.080394;0.347599;0.093625;0.558776;0.45373;0.275445;0.270825;0.409985;0.00429;0.585725;0.162941;0.128565;0.003945;0.304175;0.00075;0.19228 >0.015575;0.128615;0.18182;0.010233;0.006035;0.006693;0.628083;0.509804;0.002916;0.48728;0.564515;0.000295;0.43519;0.82232;0.000115;0.177565;0.836858;0.003037;0.00001;0.843905;0.956825;0.000005 >0.010233;0.131695;0.509804;0.002916;0.000115;0.119203;0.665299;0.836858;0.003037;0.160105;0.642435;0.00077;0.356795;0.89494;0.00001;0.10505;0.924991;0.001083;0.000005;0.851295;0.999195;0 >0.002916;0.82232;0.836858;0.003037;0.00001;0.993307;0.4985;0.924991;0.001083;0.073921;0.843905;0.00277;0.15332;0.956825;0.000005;0.04316;0.941577;0.000813;0;0.919575;0.999835;0 >0.003037;0.89494;0.924991;0.001083;0.000005;0.119203;0.606828;0.941577;0.000813;0.057608;0.851295;0.001965;0.14674;0.999195;0;0.0008;0.964536;0.000653;0;0.952645;0.99983;0 >0.001083;0.956825;0.941577;0.000813;0;0.017986;0.601088;0.964536;0.000653;0.034808;0.919575;0.001485;0.07893;0.999835;0;0.000165;0.975557;0.001343;0;0.95212;0.999935;0 >0.000813;0.999195;0.964536;0.000653;0;0.047426;0.514996;0.975557;0.001343;0.023099;0.952645;0.003555;0.0438;0.99983;0;0.00017;0.977389;0.003606;0;0.94496;0.999805;0 >0.000653;0.999835;0.975557;0.001343;0;0.880797;0.218744;0.977389;0.003606;0.019006;0.95212;0.010345;0.03754;0.999935;0;0.000065;0.974858;0.006419;0;0.937955;0.99909;0 >0.001343;0.99983;0.977389;0.003606;0;0.047426;0.347964;0.974858;0.006419;0.018723;0.94496;0.01178;0.04326;0.999805;0;0.000195;0.973184;0.004053;0;0.90462;0.997515;0 >0.003606;0.999935;0.974858;0.006419;0;0.047426;0.49725;0.973184;0.004053;0.022763;0.937955;0.011675;0.05037;0.99909;0;0.00091;0.960937;0.003776;0;0.799285;0.986475;0.000035 >0.006419;0.999805;0.973184;0.004053;0;0.731059;0.426046;0.960937;0.003776;0.035285;0.90462;0.010795;0.08458;0.997515;0;0.002485;0.928221;0.003408;0.000035;0.72453;0.894125;0.00006 >0.004053;0.99909;0.960937;0.003776;0;0.047426;0.725916;0.928221;0.003408;0.068373;0.799285;0.00964;0.19108;0.986475;0.000035;0.01349;0.842942;0.003839;0.00006;0.698595;0.739345;0.00018 >0.003776;0.997515;0.928221;0.003408;0.000035;0.047426;0.299643;0.842942;0.003839;0.153217;0.72453;0.010775;0.264695;0.894125;0.00006;0.10581;0.650625;0.012722;0.00018;0.663635;0.65696;0.008015 >0.003408;0.986475;0.842942;0.003839;0.00006;0.5;0.536187;0.650625;0.012722;0.336655;0.698595;0.03711;0.2643;0.739345;0.00018;0.260475;0.505169;0.043735;0.008015;0.630895;0.39294;0.00277 >0.072568;0.06769;0.021446;0.038434;0.07939;0.047426;0;0.348958;0.036144;0.614896;0.077135;0.00615;0.91671;0.021545;0.07638;0.902075;0.348864;0.030603;0.034935;0.11489;0.024195;0.162845 >0.038434;0.020485;0.348958;0.036144;0.07638;0.119203;0.803924;0.348864;0.030603;0.620533;0.085065;0.027595;0.88734;0.020085;0.034935;0.94498;0.357251;0.112677;0.162845;0.11449;0.02295;0.32538 >0.036144;0.021545;0.348864;0.030603;0.034935;0.119203;0;0.357251;0.112677;0.530072;0.11489;0.14152;0.743585;0.024195;0.162845;0.812965;0.349695;0.205096;0.32538;0.095495;0.01456;0.18781 >0.402061;0.154955;0.200411;0.383701;0.146995;0.047426;0.335815;0.644193;0.058533;0.297273;0.413315;0.108035;0.478645;0.64462;0.004885;0.350495;0.647131;0.056199;0.005445;0.40603;0.673135;0.008125 >0.383701;0.16909;0.644193;0.058533;0.004885;0.047426;0.165481;0.647131;0.056199;0.296669;0.416905;0.100475;0.48262;0.649845;0.005445;0.34471;0.65127;0.039951;0.008125;0.39438;0.667785;0.00753 >0.058533;0.64462;0.647131;0.056199;0.005445;0.119203;0;0.65127;0.039951;0.308776;0.40603;0.04905;0.54492;0.673135;0.008125;0.31873;0.64146;0.035531;0.00753;0.376435;0.666245;0.00642 >0.056199;0.649845;0.65127;0.039951;0.008125;0.731059;0.635758;0.64146;0.035531;0.323009;0.39438;0.03017;0.57545;0.667785;0.00753;0.324685;0.373052;0.032856;0.00642;0.34128;0.66615;0.030435 >0;0;0.150847;0.094682;0.099565;0.006693;0.414353;0.849131;0.000118;0.150753;0.880315;0.000095;0.11959;0.882355;0.00012;0.11753;0.922273;0.000065;0.00003;0.992165;0.976785;0.00005 >0.094682;0.02561;0.849131;0.000118;0.00012;0.731059;0.388648;0.922273;0.000065;0.077658;0.98756;0.00006;0.012375;0.956645;0.00003;0.04332;0.924201;0.00007;0.00005;0.99962;0.99817;0 >0.000118;0.882355;0.922273;0.000065;0.00003;0.119203;0.513747;0.924201;0.00007;0.075727;0.992165;0.00007;0.007765;0.976785;0.00005;0.02316;0.957982;0.000043;0;0.99969;0.99949;0 >0.000065;0.956645;0.924201;0.00007;0.00005;0.268941;0.348418;0.957982;0.000043;0.041973;0.99962;0.000055;0.00032;0.99817;0;0.00183;0.974686;0.000038;0;0.99974;0.999935;0 >0.00007;0.976785;0.957982;0.000043;0;0.006693;0.649991;0.974686;0.000038;0.025273;0.99969;0.00005;0.00025;0.99949;0;0.00051;0.98024;0.000037;0;0.9996;0.99978;0 >0.000043;0.99817;0.974686;0.000038;0;0.000911;0.53096;0.98024;0.000037;0.019725;0.99974;0.000055;0.00021;0.999935;0;0.000065;0.983794;0.000048;0;0.9994;0.99998;0 >0.000038;0.99949;0.98024;0.000037;0;0.5;0.349327;0.983794;0.000048;0.016156;0.9996;0.000095;0.0003;0.99978;0;0.00022;0.990775;0.000049;0;0.99923;0.999985;0 >0.000037;0.999935;0.983794;0.000048;0;0.047426;0.27768;0.990775;0.000049;0.009177;0.9994;0.0001;0.000505;0.99998;0;0.00002;0.99067;0.000062;0;0.99604;0.9999;0 >0.000048;0.99978;0.990775;0.000049;0;0.119203;0.380365;0.99067;0.000062;0.009267;0.99923;0.00014;0.00063;0.999985;0;0.000015;0.986525;0.00013;0;0.986525;0.99898;0 >0.000049;0.99998;0.99067;0.000062;0;0.880797;0.668853;0.986525;0.00013;0.013345;0.99604;0.000345;0.003615;0.9999;0;0.0001;0.983631;0.000132;0;0.96961;0.99821;0 >0.000062;0.999985;0.986525;0.00013;0;0.002473;0.100833;0.983631;0.000132;0.016239;0.986525;0.00035;0.01313;0.99898;0;0.00102;0.971793;0.000128;0;0.950005;0.994645;0.000005 >0.00013;0.9999;0.983631;0.000132;0;0.017986;0.557988;0.971793;0.000128;0.028079;0.96961;0.00034;0.03005;0.99821;0;0.00179;0.939341;0.001325;0.000005;0.7612;0.908295;0.000745 >0.000132;0.99898;0.971793;0.000128;0;0.5;0.391741;0.939341;0.001325;0.059334;0.950005;0.00187;0.048125;0.994645;0.000005;0.00535;0.832202;0.010909;0.000745;0.20003;0.294375;0.000355 >0.000128;0.99821;0.939341;0.001325;0.000005;0.268941;0.276678;0.832202;0.010909;0.156889;0.7612;0.00267;0.23613;0.908295;0.000745;0.09096;0.24415;0.012425;0.000355;0.12832;0.04852;0.00063 >0.700558;0.00128;0.033528;0.51778;0.69314;0.119203;0.467048;0.957029;0.018677;0.024294;0.988125;0.00001;0.01186;0.9871;0.000125;0.01278;0.961403;0.011465;0;0.999395;0.99769;0.000005 >0.51778;0.001365;0.957029;0.018677;0.000125;0.047426;0.316695;0.961403;0.011465;0.027132;0.99916;0.000005;0.000835;0.997235;0;0.002765;0.966611;0.011268;0.000005;0.999775;0.99896;0 >0.018677;0.9871;0.961403;0.011465;0;0.017986;0.108515;0.966611;0.011268;0.022121;0.999395;0;0.000605;0.99769;0.000005;0.002305;0.97066;0.011267;0;0.99999;0.99999;0 >0.011465;0.997235;0.966611;0.011268;0.000005;0.5;0.099302;0.97066;0.011267;0.018073;0.999775;0;0.000225;0.99896;0;0.00104;0.967128;0.010435;0;0.999975;0.999955;0 >0.011268;0.99769;0.97066;0.011267;0;0.268941;0.567093;0.967128;0.010435;0.022437;0.99999;0;0.00001;0.99999;0;0.00001;0.96963;0.011402;0;0.99997;0.99994;0 >0.011267;0.99896;0.967128;0.010435;0;0.119203;0.653169;0.96963;0.011402;0.018968;0.999975;0;0.000025;0.999955;0;0.000045;0.968817;0.008672;0;0.9999;0.999595;0 >0.010435;0.99999;0.96963;0.011402;0;0.268941;0.797057;0.968817;0.008672;0.022511;0.99997;0;0.00003;0.99994;0;0.00006;0.952639;0.000037;0;0.99978;0.999795;0 >0.011402;0.999955;0.968817;0.008672;0;0.731059;0.563653;0.952639;0.000037;0.047324;0.9999;0;0.0001;0.999595;0;0.000405;0.941544;0.000053;0;0.999345;0.99894;0 >0.008672;0.99994;0.952639;0.000037;0;0.047426;0.613963;0.941544;0.000053;0.058401;0.99978;0;0.000215;0.999795;0;0.000205;0.953784;0.033575;0;0.99645;0.99539;0.00001 >0.000037;0.999595;0.941544;0.000053;0;0.119203;0.529715;0.953784;0.033575;0.012639;0.999345;0.00004;0.00061;0.99894;0;0.00106;0.950305;0.037169;0.00001;0.992425;0.988895;0.00003 >0.000053;0.999795;0.953784;0.033575;0;0.731059;0.488502;0.950305;0.037169;0.012526;0.99645;0.00008;0.003465;0.99539;0.00001;0.004605;0.953128;0.015558;0.00003;0.96765;0.981705;0.00012 >0.033575;0.99894;0.950305;0.037169;0.00001;0.5;0.426536;0.953128;0.015558;0.031314;0.992425;0.000085;0.00749;0.988895;0.00003;0.011075;0.933835;0.016748;0.00012;0.82843;0.899765;0.000305 >0.037169;0.99539;0.953128;0.015558;0.00003;0.119203;0.554285;0.933835;0.016748;0.049415;0.96765;0.00009;0.032255;0.981705;0.00012;0.018175;0.835794;0.015184;0.000305;0.151705;0.462245;0.000715 >0.015558;0.988895;0.933835;0.016748;0.00012;0.5;0.60444;0.835794;0.015184;0.149021;0.82843;0.000105;0.171465;0.899765;0.000305;0.099925;0.312008;0.008636;0.000715;0.07423;0.025605;0.00077 >0.016792;0.083015;0.117144;0.078285;0.04578;0.119203;0.546614;0.248611;0.003692;0.7477;0.21343;0.00002;0.786555;0.33779;0.00833;0.653885;0.393686;0.005201;0.00746;0.550185;0.461935;0.003405 >0.078285;0.12086;0.248611;0.003692;0.00833;0.119203;0.455617;0.393686;0.005201;0.601113;0.38517;0;0.61483;0.4509;0.00746;0.54164;0.432389;0.001162;0.003405;0.888935;0.87477;0.00035 >0.003692;0.33779;0.393686;0.005201;0.00746;0.119203;0.231831;0.432389;0.001162;0.566446;0.550185;0.00001;0.4498;0.461935;0.003405;0.534655;0.841113;0.000142;0.00035;0.97981;0.92331;0.00007 >0.005201;0.4509;0.432389;0.001162;0.003405;0.5;0.280497;0.841113;0.000142;0.158745;0.888935;0.00001;0.111055;0.87477;0.00035;0.12488;0.900356;0.000046;0.00007;0.990275;0.97047;0.00004 >0.001162;0.461935;0.841113;0.000142;0.00035;0.119203;0.523732;0.900356;0.000046;0.099599;0.97981;0.000005;0.020185;0.92331;0.00007;0.076625;0.968005;0.001144;0.00004;0.99949;0.99538;0.000115 >0.000142;0.87477;0.900356;0.000046;0.00007;0.268941;0.439301;0.968005;0.001144;0.030853;0.990275;0.000005;0.009725;0.97047;0.00004;0.02949;0.993907;0.001241;0.000115;0.99965;0.996815;0.00004 >0.000046;0.92331;0.968005;0.001144;0.00004;0.880797;0.350692;0.993907;0.001241;0.004852;0.99949;0;0.00051;0.99538;0.000115;0.004505;0.988958;0.001281;0.00004;0.99971;0.995905;0.000015 >0.001144;0.97047;0.993907;0.001241;0.000115;0.5;0.329599;0.988958;0.001281;0.009761;0.99965;0;0.00035;0.996815;0.00004;0.003145;0.985293;0.000038;0.000015;0.999985;0.99454;0.00001 >0.001241;0.99538;0.988958;0.001281;0.00004;0.119203;0.452642;0.985293;0.000038;0.014667;0.99971;0;0.000285;0.995905;0.000015;0.00408;0.979348;0.005637;0.00001;0.99995;0.995545;0.00001 >0.001281;0.996815;0.985293;0.000038;0.000015;0.5;0.606828;0.979348;0.005637;0.015015;0.999985;0;0.000015;0.99454;0.00001;0.00545;0.994323;0.003886;0.00001;0.99995;0.99819;0 >0.000038;0.995905;0.979348;0.005637;0.00001;0.5;0.719302;0.994323;0.003886;0.001791;0.99995;0;0.00005;0.995545;0.00001;0.004445;0.993189;0.000233;0;0.999355;0.999255;0 >0.005637;0.99454;0.994323;0.003886;0.00001;0.268941;0.179167;0.993189;0.000233;0.006578;0.99995;0;0.00005;0.99819;0;0.00181;0.999115;0.000211;0;0.99927;0.99949;0 >0.003886;0.995545;0.993189;0.000233;0;0.119203;0.67787;0.999115;0.000211;0.000674;0.999355;0;0.000645;0.999255;0;0.000745;0.998113;0.000171;0;0.99831;0.99966;0 >0.000233;0.99819;0.999115;0.000211;0;0.731059;0;0.998113;0.000171;0.001716;0.99927;0.000005;0.000725;0.99949;0;0.00051;0.998331;0.000116;0;0.991755;0.99985;0 >0.000211;0.999255;0.998113;0.000171;0;0.047426;0.286182;0.998331;0.000116;0.001553;0.99831;0.000005;0.001685;0.99966;0;0.00034;0.996304;0.000106;0;0.9818;0.99917;0 >0.000171;0.99949;0.998331;0.000116;0;0.268941;0.441519;0.996304;0.000106;0.003589;0.991755;0.00001;0.008235;0.99985;0;0.00015;0.984398;0.000236;0;0.8794;0.99648;0.000005 >0.000116;0.99966;0.996304;0.000106;0;0.119203;0.457354;0.984398;0.000236;0.015367;0.9818;0.000575;0.01763;0.99917;0;0.00083;0.94823;0.00062;0.000005;0.83004;0.981515;0.000025 >0.000236;0.99917;0.94823;0.00062;0.000005;0.5;0.315831;0.916663;0.000691;0.082648;0.83004;0.002015;0.167945;0.981515;0.000025;0.018465;0.849844;0.000841;0.000045;0.60209;0.93436;0.009335 >0.00062;0.99648;0.916663;0.000691;0.000025;0.119203;0.512497;0.849844;0.000841;0.149313;0.69651;0.002445;0.301045;0.943615;0.000045;0.056335;0.784729;0.00706;0.009335;0.550905;0.788395;0.01519 >0.000691;0.981515;0.849844;0.000841;0.000045;0.119203;0.607305;0.784729;0.00706;0.208215;0.60209;0.01181;0.386105;0.93436;0.009335;0.05631;0.697286;0.014475;0.01519;0.00835;0.10455;0.00141 >0.672462;0.005635;0.048684;0.372434;0.79551;0.5;0.023823;0.941118;0.006192;0.052691;0.9675;0.000015;0.032485;0.983405;0.00004;0.01656;0.955736;0.008144;0.000005;0.999815;0.999065;0 >0.372434;0.006595;0.941118;0.006192;0.00004;0.047426;0.444974;0.955736;0.008144;0.036119;0.99976;0;0.000235;0.99842;0.000005;0.001575;0.976851;0.000012;0;0.99996;0.9999;0 >0.006192;0.983405;0.955736;0.008144;0.000005;0.017986;0.199248;0.976851;0.000012;0.023137;0.999815;0.000005;0.00018;0.999065;0;0.000935;0.983451;0.000012;0;0.999985;0.99999;0 >0.008144;0.99842;0.976851;0.000012;0;0.119203;0.057487;0.983451;0.000012;0.016538;0.99996;0.000005;0.000035;0.9999;0;0.0001;0.982751;0.000012;0;0.999985;0.99985;0 >0.000012;0.999065;0.983451;0.000012;0;0.952574;0.597967;0.982751;0.000012;0.017236;0.999985;0.000005;0.000005;0.99999;0;0.00001;0.975726;0.000529;0;0.999965;0.999845;0 >0.000012;0.9999;0.982751;0.000012;0;0.047426;0.574443;0.975726;0.000529;0.023745;0.999985;0.000005;0.00001;0.99985;0;0.00015;0.972878;0.000503;0;0.999435;0.999335;0 >0.000012;0.99999;0.975726;0.000529;0;0.5;0.597967;0.972878;0.000503;0.026617;0.999965;0.000005;0.000025;0.999845;0;0.000155;0.967218;0.002922;0;0.99919;0.998885;0 >0.000529;0.99985;0.972878;0.000503;0;0.880797;0.389361;0.967218;0.002922;0.029857;0.999435;0.000005;0.00055;0.999335;0;0.000665;0.946015;0.009153;0;0.998625;0.988535;0.00001 >0.000503;0.999845;0.967218;0.002922;0;0.268941;0.563899;0.946015;0.009153;0.044832;0.99919;0.000005;0.000805;0.998885;0;0.001115;0.926613;0.003178;0.00001;0.99058;0.968105;0.00004 >0.002922;0.999335;0.946015;0.009153;0;0.047426;0.58686;0.926613;0.003178;0.070208;0.998625;0.00002;0.001355;0.988535;0.00001;0.01145;0.853032;0.000042;0.00004;0.381295;0.802785;0.00148 >0.009153;0.998885;0.926613;0.003178;0.00001;0.880797;0.279086;0.853032;0.000042;0.146931;0.99058;0.000025;0.0094;0.968105;0.00004;0.031865;0.558403;0.000559;0.00148;0.23099;0.6761;0.000085 >0.002453;0.6761;0.256732;0.000116;0.00013;0.017986;0.44695;0.407903;0.000456;0.591639;0.19305;0.00006;0.806885;0.54496;0.00067;0.45437;0.658868;0.001987;0.00152;0.9086;0.75571;0.002275 >0.000116;0.42864;0.407903;0.000456;0.00067;0.047426;0.516494;0.658868;0.001987;0.339146;0.579265;0.00031;0.420425;0.581085;0.00152;0.417395;0.787854;0.002991;0.002275;0.96037;0.92668;0.00197 >0.000456;0.54496;0.658868;0.001987;0.00152;0.268941;0.347057;0.787854;0.002991;0.209154;0.9086;0.00226;0.089145;0.75571;0.002275;0.24201;0.881451;0.001828;0.00197;0.986125;0.96213;0.00024 >0.002991;0.75571;0.881451;0.001828;0.00197;0.017986;0.350464;0.819756;0.00041;0.179832;0.986125;0.000085;0.013785;0.96213;0.00024;0.03763;0.807821;0.000242;0.000225;0.99317;0.98375;0.00066 >0.001828;0.92668;0.819756;0.00041;0.00024;0.731059;0.45289;0.807821;0.000242;0.191934;0.9916;0.000265;0.008125;0.96689;0.000225;0.032885;0.833145;0.000286;0.00066;0.99834;0.985635;0.00004 >0.00041;0.96213;0.807821;0.000242;0.000225;0.880797;0.282925;0.833145;0.000286;0.166569;0.99317;0.000075;0.006755;0.98375;0.00066;0.01559;0.933461;0.000049;0.00004;0.998295;0.99341;0.000005 >0.000242;0.96689;0.833145;0.000286;0.00066;0.268941;0.233438;0.933461;0.000049;0.066492;0.99834;0.00004;0.001625;0.985635;0.00004;0.014325;0.948341;0.000038;0.000005;0.99838;0.997545;0 >0.000038;0.99341;0.973207;0.000044;0;0.731059;0.676558;0.97968;0.00275;0.017572;0.998635;0.00019;0.001175;0.9994;0.000005;0.0006;0.97611;0.002973;0;0.998365;0.999885;0 >0.000044;0.997545;0.97968;0.00275;0.000005;0.047426;0.498;0.97611;0.002973;0.020915;0.998615;0.000385;0.000995;0.99987;0;0.00013;0.980911;0.00034;0;0.996925;0.99964;0 >0.00275;0.9994;0.97611;0.002973;0;0.5;0.634599;0.980911;0.00034;0.018749;0.998365;0.00099;0.000645;0.999885;0;0.000115;0.983061;0.000367;0;0.99072;0.997965;0 >0.002973;0.99987;0.980911;0.00034;0;0.952574;0.843169;0.983061;0.000367;0.01657;0.996925;0.00107;0.001995;0.99964;0;0.00036;0.979397;0.000427;0;0.981765;0.996365;0.00001 >0.00034;0.999885;0.983061;0.000367;0;0.5;0.093638;0.979397;0.000427;0.020175;0.99072;0.00125;0.008025;0.997965;0;0.002035;0.974694;0.00077;0.00001;0.94478;0.98166;0.00056 >0.000367;0.99964;0.979397;0.000427;0;0.047426;0.39556;0.974694;0.00077;0.024536;0.981765;0.00227;0.015965;0.996365;0.00001;0.003625;0.907345;0.002525;0.00056;0.81602;0.90559;0.000965 >0.000427;0.997965;0.974694;0.00077;0.00001;0.952574;0.42678;0.907345;0.002525;0.090128;0.94478;0.002605;0.052615;0.98166;0.00056;0.017775;0.806528;0.002708;0.000965;0.37423;0.168285;0.003255 >0.00077;0.996365;0.907345;0.002525;0.00056;0.952574;0.149822;0.806528;0.002708;0.190764;0.81602;0.00275;0.18123;0.90559;0.000965;0.093445;0.25032;0.005192;0.003255;0.194985;0.056635;0.00331 >0.509818;0.000275;0.05059;0.37672;0.54454;0.731059;0.282114;0.951709;0.024367;0.023924;0.980945;0.00005;0.019005;0.997815;0.00003;0.002155;0.95633;0.001666;0.000005;0.995795;0.999305;0 >0.37672;0.00048;0.951709;0.024367;0.00003;0.047426;0.459837;0.95633;0.001666;0.042002;0.995415;0;0.004585;0.999065;0.000005;0.000925;0.976598;0.001666;0;0.99991;0.999895;0 >0.024367;0.997815;0.95633;0.001666;0.000005;0.002473;0.037652;0.976598;0.001666;0.021734;0.995795;0.000005;0.004195;0.999305;0;0.000695;0.982954;0.003358;0;0.999945;0.999985;0 >0.001666;0.999065;0.976598;0.001666;0;0.268941;0.284144;0.982954;0.003358;0.013688;0.99991;0.00002;0.00007;0.999895;0;0.000105;0.983102;0.002203;0;0.999865;0.999905;0 >0.001666;0.999305;0.982954;0.003358;0;0.5;0.378246;0.983102;0.002203;0.01469;0.999945;0.00002;0.00003;0.999985;0;0.000005;0.975464;0.002224;0;0.999865;0.999855;0 >0.003358;0.999895;0.983102;0.002203;0;0.268941;0.476268;0.975464;0.002224;0.02231;0.999865;0.00002;0.000115;0.999905;0;0.00009;0.969589;0.000018;0;0.998255;0.998425;0.00001 >0.002203;0.999985;0.975464;0.002224;0;0.880797;0.693174;0.969589;0.000018;0.030394;0.999865;0.000025;0.000115;0.999855;0;0.000145;0.94434;0.002803;0.00001;0.99444;0.993315;0.000145 >0.002224;0.999905;0.969589;0.000018;0;0.5;0.529217;0.94434;0.002803;0.052856;0.998255;0.000025;0.001715;0.998425;0.00001;0.001565;0.933727;0.006593;0.000145;0.99246;0.961665;0.0002 >0.000018;0.999855;0.94434;0.002803;0.00001;0.047426;0.41096;0.933727;0.006593;0.059682;0.99444;0.000025;0.005535;0.993315;0.000145;0.006545;0.911364;0.002948;0.0002;0.495955;0.625965;0.00081 >0.002803;0.998425;0.933727;0.006593;0.000145;0.880797;0.443246;0.911364;0.002948;0.085688;0.99246;0.00002;0.00752;0.961665;0.0002;0.038135;0.53995;0.000964;0.00081;0.000715;0.021055;0.000915 >0.000021;0.977485;0.955565;0.000027;0.00002;0.731059;0.643136;0.981244;0.000015;0.018738;0.996045;0.00001;0.00394;0.998985;0;0.00101;0.980455;0.000014;0;0.997085;0.999905;0 >0.000027;0.98628;0.981244;0.000015;0;0.268941;0.287614;0.980455;0.000014;0.019531;0.997015;0.00001;0.002975;0.999835;0;0.000165;0.977148;0.000026;0;0.9991;0.99988;0 >0.000015;0.998985;0.980455;0.000014;0;0.268941;0.61822;0.977148;0.000026;0.022826;0.997085;0.000045;0.002875;0.999905;0;0.00009;0.985975;0.000036;0;0.998265;0.99845;0 >0.000014;0.999835;0.977148;0.000026;0;0.880797;0.532952;0.985975;0.000036;0.013989;0.9991;0.000075;0.000825;0.99988;0;0.00012;0.979153;0.000048;0;0.99201;0.995045;0 >0.000026;0.999905;0.985975;0.000036;0;0.119203;0.195761;0.979153;0.000048;0.020799;0.998265;0.00011;0.001625;0.99845;0;0.00155;0.965073;0.003846;0;0.962435;0.974465;0.000075 >0.000036;0.99988;0.979153;0.000048;0;0.047426;0.537927;0.965073;0.003846;0.031079;0.99201;0.000105;0.00788;0.995045;0;0.004955;0.928092;0.000133;0.000075;0.90279;0.723185;0.01667 >0.000048;0.99845;0.965073;0.003846;0;0.268941;0.536435;0.928092;0.000133;0.071776;0.962435;0.00029;0.03727;0.974465;0.000075;0.025465;0.789833;0.006586;0.01667;0.37088;0.27723;0.001685 >0.003846;0.995045;0.928092;0.000133;0.000075;0.880797;0.675902;0.789833;0.006586;0.203578;0.90279;0.003055;0.09415;0.723185;0.01667;0.26014;0.251092;0.006111;0.001685;0.16336;0.06408;0.00334 >0.655616;0.0014;0.028594;0.4833;0.96643;0.731059;0.123467;0.975345;0.001873;0.02278;0.989215;0.00001;0.010775;0.999055;0.00001;0.00093;0.974067;0.00476;0;0.99997;0.999835;0 >0.4833;0.001525;0.975345;0.001873;0.00001;0.006693;0.393649;0.974067;0.00476;0.021172;0.999815;0;0.000185;0.999715;0;0.000285;0.976068;0.007189;0;0.999995;0.999965;0 >0.001873;0.999055;0.974067;0.00476;0;0.006693;0.255213;0.976068;0.007189;0.016743;0.99997;0;0.00003;0.999835;0;0.000165;0.973135;0.001278;0;0.99999;0.999985;0 >0.00476;0.999715;0.976068;0.007189;0;0.5;0.163693;0.973135;0.001278;0.025587;0.999995;0;0.000005;0.999965;0;0.000035;0.97794;0.002564;0;0.99996;0.99998;0 >0.007189;0.999835;0.973135;0.001278;0;0.982014;0.511248;0.97794;0.002564;0.019495;0.99999;0;0.000005;0.999985;0;0.000015;0.978694;0.000011;0;0.999955;0.999965;0 >0.001278;0.999965;0.97794;0.002564;0;0.268941;0.742691;0.978694;0.000011;0.021295;0.99996;0;0.00004;0.99998;0;0.00002;0.97844;0.000012;0;0.99834;0.99947;0 >0.002564;0.999985;0.978694;0.000011;0;0.047426;0.529964;0.97844;0.000012;0.021546;0.999955;0;0.00004;0.999965;0;0.000035;0.949995;0.00235;0;0.9983;0.999175;0 >0.000011;0.99998;0.97844;0.000012;0;0.880797;0.321475;0.949995;0.00235;0.047653;0.99834;0.000015;0.00164;0.99947;0;0.00053;0.937838;0.008506;0;0.99788;0.964215;0.000005 >0.000012;0.999965;0.949995;0.00235;0;0.731059;0.579812;0.937838;0.008506;0.053654;0.9983;0.000015;0.00168;0.999175;0;0.000825;0.917462;0.003603;0.000005;0.995875;0.941665;0.00006 >0.00235;0.99947;0.937838;0.008506;0;0.047426;0.528968;0.917462;0.003603;0.078933;0.99788;0.000095;0.00202;0.964215;0.000005;0.03578;0.88201;0.00159;0.00006;0.78624;0.905205;0.000245 >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From ryanmackenzieconway at gmail.com Thu Sep 14 12:47:22 2017 From: ryanmackenzieconway at gmail.com (Ryan Conway) Date: Thu, 14 Sep 2017 09:47:22 -0700 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search In-Reply-To: References: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> <20170913185347.GE4066674@phare.normalesup.org> Message-ID: Thank you, Andreas. Indeed this becomes cumbersome when we don't know the prototype of the terminating function. > it's pretty easy to implement this by creating your own Pipeline subclass, isn't it? Good idea, that's probably the route I will take. That said, as a newcomer to sklearn, a benefit of utility classes such as Pipeline is that their interface helps me understand the library developers' intent and how its components should fit together. Prior to this conversation I lacked confidence that Pipeline was suitable for my use case. Ryan On Wed, Sep 13, 2017 at 2:14 PM, Joel Nothman wrote: > it's pretty easy to implement this by creating your own Pipeline subclass, > isn't it? > > On 14 Sep 2017 4:55 am, "Gael Varoquaux" > wrote: > >> On Wed, Sep 13, 2017 at 02:45:41PM -0400, Andreas Mueller wrote: >> > We could add a way to call non-standard methods, but I'm not sure that >> is the >> > right way to go. >> > (like pipeline.custom_method(X, method="kneighbors")). But that assumes >> that >> > the method signature is X or (X, y). >> > So I'm not sure if this is generally useful. >> >> I don't see either why it's useful. We shouldn't add a method for >> everything that can be easily coded with a few lines of Python. The nice >> thing of Python is that it is such an expressive language. >> >> Ga?l >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Sep 14 16:37:17 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 14 Sep 2017 16:37:17 -0400 Subject: [scikit-learn] New commits mailing list Message-ID: <69144745-36f0-0eb5-6f8e-85ef12659aae@gmail.com> Hey all. I set up scikit-learn-commits at python.org to track commits to the scikit-learn repo. You can subscribe here: https://mail.python.org/mm3/mailman3/lists/scikit-learn-commits.python.org/ Cheers, Andy From t3kcit at gmail.com Thu Sep 14 16:41:42 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 14 Sep 2017 16:41:42 -0400 Subject: [scikit-learn] custom loss function In-Reply-To: References: <7e438aef-13b0-fb2d-fa7e-fd8e8db587dd@gmail.com> Message-ID: On 09/13/2017 05:31 PM, Thomas Evangelidis wrote: > What about the SVM? I use an SVR at the end to combine multiple > MLPRegressor predictions using the rbf kernel (linear kernel is not > good for this problem). Can I also implement an SVR with rbf kernel in > Tensorflow using my own loss function? So far I found an example of an > SVC with linear kernel in Tensorflow and nothing in Keras. My > alternative option would be to train multiple SVRs and find through > cross validation the one that minimizes my custom loss function, but > as I said in a previous message, that would be a suboptimal solution > because in scikit-learn the SVR minimizes the default loss function. > Depends on what algorithm you want to use. As Frederico said, SVMs are usually solved as convex optimization problem on an infinite dimensional kernel space. There is no straight-forward way to extend this to arbitrary losses afaik. You can always make the kernel transformation explicit with Nystroem and solve a linear regression problem with custom loss on that. From nj.yuanli at gmail.com Thu Sep 14 21:24:20 2017 From: nj.yuanli at gmail.com (L Ali) Date: Thu, 14 Sep 2017 21:24:20 -0400 Subject: [scikit-learn] Help needed Message-ID: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> Hi guys, I am totally new to the scikit-learn, I am going to submit a pull request to the repository, but always got following error message, I could not find any usefully information from Google, my last hope is our community. Is there anyone can give me some advise about this error: ModuleNotFoundError: No module named 'matplotlib' Thanks so much! Li Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: F29260EB7C164340BDED515E54DA3883.png Type: image/png Size: 91581 bytes Desc: not available URL: From se.raschka at gmail.com Thu Sep 14 21:34:40 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 14 Sep 2017 21:34:40 -0400 Subject: [scikit-learn] Help needed In-Reply-To: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> References: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> Message-ID: <66F7C64C-8092-406D-B2C1-A0E3CF814FB3@gmail.com> Hi, Li, to me, it looks like you are importing matplotlib in your code, but matplotlib is not being installed on the CI instances that are running the scikit-learn unit tests. Or in other words, the Travis instance is trying to execute an "import matplotlib..." and fails because matplotlib is not installed there. Except for the docs, I think matplotlib code is not being tested in scikit-learn's unit tests (and hence, it's not being installed). Does your code/contribution require matplotlib or is it just imported "by accident"? If the latter is true, simply removing matplotlib imports will prob. solve the issue; otherwise, I guess discussing the PR via an issue with the main devs might be the way to go. Best, Sebastian > On Sep 14, 2017, at 9:24 PM, L Ali wrote: > > Hi guys, > > I am totally new to the scikit-learn, I am going to submit a pull request to the repository, but always got following error message, I could not find any usefully information from Google, my last hope is our community. > > Is there anyone can give me some advise about this error: ModuleNotFoundError: No module named 'matplotlib' > > Thanks so much! > > > > Li Yuan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nj.yuanli at gmail.com Thu Sep 14 21:42:08 2017 From: nj.yuanli at gmail.com (L Ali) Date: Thu, 14 Sep 2017 21:42:08 -0400 Subject: [scikit-learn] Help needed In-Reply-To: <66F7C64C-8092-406D-B2C1-A0E3CF814FB3@gmail.com> References: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> <66F7C64C-8092-406D-B2C1-A0E3CF814FB3@gmail.com> Message-ID: <59bb2ff2.9124ed0a.d1875.d049@mx.google.com> Hi Sebastian, Thanks for your quick response, there are two functions in my code will output a chart using matplotlib. Do you know how can I discussing the PR via an issue with the main devs? Sorry for such stupid questions. Thanks again for your advise. Li Yuan From: Sebastian Raschka Sent: Thursday, September 14, 2017 9:36 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] Help needed Hi, Li, to me, it looks like you are importing matplotlib in your code, but matplotlib is not being installed on the CI instances that are running the scikit-learn unit tests. Or in other words, the Travis instance is trying to execute an "import matplotlib..." and fails because matplotlib is not installed there. Except for the docs, I think matplotlib code is not being tested in scikit-learn's unit tests (and hence, it's not being installed). Does your code/contribution require matplotlib or is it just imported "by accident"? If the latter is true, simply removing matplotlib imports will prob. solve the issue; otherwise, I guess discussing the PR via an issue with the main devs might be the way to go. Best, Sebastian > On Sep 14, 2017, at 9:24 PM, L Ali wrote: > > Hi guys, > > I am totally new to the scikit-learn, I am going to submit a pull request to the repository, but always got following error message, I could not find any usefully information from Google, my last hope is our community. > > Is there anyone can give me some advise about this error: ModuleNotFoundError: No module named 'matplotlib' > > Thanks so much! > > > > Li Yuan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Sep 14 22:49:12 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 14 Sep 2017 22:49:12 -0400 Subject: [scikit-learn] Help needed In-Reply-To: <59bb2ff2.9124ed0a.d1875.d049@mx.google.com> References: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> <66F7C64C-8092-406D-B2C1-A0E3CF814FB3@gmail.com> <59bb2ff2.9124ed0a.d1875.d049@mx.google.com> Message-ID: <45EB7023-2715-4AD8-A8FC-B5F25026E53C@gmail.com> Honestly not sure what the core dev's preference is, but maybe just submit it as a PR and take the discussion (for a potential removal, inclusion, or move of these features to the documentation) of the additional plotting features from there. Best, Sebastian > On Sep 14, 2017, at 9:42 PM, L Ali wrote: > > Hi Sebastian, > > Thanks for your quick response, there are two functions in my code will output a chart using matplotlib. Do you know how can I discussing the PR via an issue with the main devs? Sorry for such stupid questions. > > Thanks again for your advise. > > Li Yuan > > From: Sebastian Raschka > Sent: Thursday, September 14, 2017 9:36 PM > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Help needed > > Hi, Li, > > to me, it looks like you are importing matplotlib in your code, but matplotlib is not being installed on the CI instances that are running the scikit-learn unit tests. Or in other words, the Travis instance is trying to execute an "import matplotlib..." and fails because matplotlib is not installed there. Except for the docs, I think matplotlib code is not being tested in scikit-learn's unit tests (and hence, it's not being installed). Does your code/contribution require matplotlib or is it just imported "by accident"? If the latter is true, simply removing matplotlib imports will prob. solve the issue; otherwise, I guess discussing the PR via an issue with the main devs might be the way to go. > > Best, > Sebastian > > > On Sep 14, 2017, at 9:24 PM, L Ali wrote: > > > > Hi guys, > > > > I am totally new to the scikit-learn, I am going to submit a pull request to the repository, but always got following error message, I could not find any usefully information from Google, my last hope is our community. > > > > Is there anyone can give me some advise about this error: ModuleNotFoundError: No module named 'matplotlib' > > > > Thanks so much! > > > > > > > > Li Yuan > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Sep 15 10:20:53 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 15 Sep 2017 10:20:53 -0400 Subject: [scikit-learn] Help needed In-Reply-To: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> References: <59bb2bc5.7433c80a.48fec.ddde@mx.google.com> Message-ID: I think you already submitted a PR, right? The PR is definitely the right place to discuss this. Can I asked what you googled that didn't yield results? Because I get these instructions: https://matplotlib.org/faq/installing_faq.html On 09/14/2017 09:24 PM, L Ali wrote: > > Hi guys, > > I am totally new to the scikit-learn, I am going to submit a pull > request to the repository, but always got following error message, I > could not find any usefully information from Google, my last hope is > our community. > > Is there anyone can give me some advise about this error: > *ModuleNotFoundError: No module named 'matplotlib'* > > Thanks so much! > > Li Yuan > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Sep 16 06:53:52 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 16 Sep 2017 20:53:52 +1000 Subject: [scikit-learn] Accessing Clustering Feature Tree in Birch In-Reply-To: References: Message-ID: There is no such thing as "the data samples in this cluster". The point of Birch being online is that it loses any reference to the individual samples that contributed to each node, but stores some statistics on their basis. Roman Yurchak has, however, offered a PR where, for the non-online case, storage of the indices contributing to each node can be optionally turned on: https://github.com/scikit-learn/scikit-learn/pull/8808 As for finding what is contained under any particular node, traversing the tree is a fairly basic task from a computer science perspective. Before we were to support something to make this much easier, I think we'd need to be clear on what kinds of use case we were supporting. What do you hope to do with this information, and what would a function interface look like that would make this much easier? Decimals aren't a practical option as the branching factor may be greater than 10, it is a hard structure to inspect, and susceptible to computational imprecision. Better off with a list of tuples, but what for that is not easy enough to do now? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Sep 16 07:00:00 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 16 Sep 2017 21:00:00 +1000 Subject: [scikit-learn] Terminating a Pipeline with a NearestNeighbors search In-Reply-To: References: <7cf5de8b-f8de-53e8-5988-77e374f35d14@gmail.com> <20170913185347.GE4066674@phare.normalesup.org> Message-ID: Pipelines are useful for creating composite estimators that can be plugged in elsewhere. At the moment we have nowhere you can plug in a neighborhood calculator, although we would like to (see https://github.com/scikit-learn/scikit-learn/pull/8999 for the latest attempt). If there are then compelling use-cases for allowing that pluggable neighbors estimator to be a pipeline, we might follow that path. But as you implied, we don't want to confuse users if such use-cases are far fetched. What do you intend to use it for? On 15 September 2017 at 02:47, Ryan Conway wrote: > Thank you, Andreas. Indeed this becomes cumbersome when we don't know the > prototype of the terminating function. > > > it's pretty easy to implement this by creating your own Pipeline > subclass, isn't it? > > Good idea, that's probably the route I will take. That said, as a > newcomer to sklearn, a benefit of utility classes such as Pipeline is that > their interface helps me understand the library developers' intent and how > its components should fit together. Prior to this conversation I lacked > confidence that Pipeline was suitable for my use case. > > Ryan > > On Wed, Sep 13, 2017 at 2:14 PM, Joel Nothman > wrote: > >> it's pretty easy to implement this by creating your own Pipeline >> subclass, isn't it? >> >> On 14 Sep 2017 4:55 am, "Gael Varoquaux" >> wrote: >> >>> On Wed, Sep 13, 2017 at 02:45:41PM -0400, Andreas Mueller wrote: >>> > We could add a way to call non-standard methods, but I'm not sure that >>> is the >>> > right way to go. >>> > (like pipeline.custom_method(X, method="kneighbors")). But that >>> assumes that >>> > the method signature is X or (X, y). >>> > So I'm not sure if this is generally useful. >>> >>> I don't see either why it's useful. We shouldn't add a method for >>> everything that can be easily coded with a few lines of Python. The nice >>> thing of Python is that it is such an expressive language. >>> >>> Ga?l >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chyikwei.yau at gmail.com Sun Sep 17 19:52:51 2017 From: chyikwei.yau at gmail.com (chyi-kwei yau) Date: Sun, 17 Sep 2017 23:52:51 +0000 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: References: Message-ID: Hi Markus, I tried your code and find the issue might be there are only 18 docs in the Gutenberg corpus. if you print out transformed doc topic distribution, you will see a lot of topics are not used. And since there is no words assigned to those topics, the weights will be equal to`topic_word_prior` parameter. You can print out the transformed doc topic distributions like this: ------------- >>> doc_distr = lda.fit_transform(tf) >>> for d in doc_distr: ... print np.where(d > 0.001)[0] ... [17 27] [17 27] [17 27 28] [14] [ 2 4 28] [ 2 4 15 21 27 28] [1] [ 1 2 17 21 27 28] [ 2 15 17 22 28] [ 2 17 21 22 27 28] [ 2 15 17 28] [ 2 17 21 27 28] [ 2 14 15 17 21 22 27 28] [15 22] [ 8 11] [8] [ 8 24] [ 2 14 15 22] and my full test scripts are here: https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 Best, Chyi-Kwei On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad wrote: > Hi there, > > I'm trying out sklearn's latent Dirichlet allocation implementation for > topic modeling. The code from the official example [1] works just fine and > the extracted topics look reasonable. However, when I try other corpora, > for example the Gutenberg corpus from NLTK, most of the extracted topics > are garbage. See this example output, when trying to get 30 topics: > > Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane > (301.83) > Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother > (55.27) > Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles > (166.21) > Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) > fatiguing (0.01) > ... > > Many topics tend to have the same weights, all equal to the > `topic_word_prior` parameter. > > This is my script: > > import nltk > from sklearn.feature_extraction.text import CountVectorizer > from sklearn.decomposition import LatentDirichletAllocation > > def print_top_words(model, feature_names, n_top_words): > for topic_idx, topic in enumerate(model.components_): > message = "Topic #%d: " % topic_idx > message += " ".join([feature_names[i] + " (" + str(round(topic[i], > 2)) + ")" > for i in topic.argsort()[:-n_top_words - > 1:-1]]) > print(message) > > > data_samples = [nltk.corpus.gutenberg.raw(f_id) > for f_id in nltk.corpus.gutenberg.fileids()] > > tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, > stop_words='english') > tf = tf_vectorizer.fit_transform(data_samples) > > lda = LatentDirichletAllocation(n_components=30, > learning_method='batch', > n_jobs=-1, # all CPUs > verbose=1, > evaluate_every=10, > max_iter=1000, > doc_topic_prior=0.1, > topic_word_prior=0.01, > random_state=1) > > lda.fit(tf) > tf_feature_names = tf_vectorizer.get_feature_names() > print_top_words(lda, tf_feature_names, 5) > > Is there a problem in how I set up the LatentDirichletAllocation instance > or pass the data? I tried out different parameter settings, but none of > them provided good results for that corpus. I also tried out alternative > implementations (like the lda package [2]) and those were able to find > reasonable topics. > > Best, > Markus > > > [1] > http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py > [2] http://pythonhosted.org/lda/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.konrad at wzb.eu Mon Sep 18 12:26:35 2017 From: markus.konrad at wzb.eu (Markus Konrad) Date: Mon, 18 Sep 2017 18:26:35 +0200 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: References: Message-ID: Hi Chyi-Kwei, thanks for digging into this. I made similar observations with Gensim when using only a small number of (big) documents. Gensim also uses the Online Variational Bayes approach (Hoffman et al.). So could it be that the Hoffman et al. method is problematic in such scenarios? I found that Gibbs sampling based implementations provide much more informative topics in this case. If this was the case, then if I'd slice the documents in some way (say every N paragraphs become a "document") then I should get better results with scikit-learn and Gensim, right? I think I'll try this out tomorrow. Best, Markus > Date: Sun, 17 Sep 2017 23:52:51 +0000 > From: chyi-kwei yau > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find > topics in NLTK Gutenberg corpus? > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hi Markus, > > I tried your code and find the issue might be there are only 18 docs > in the Gutenberg > corpus. > if you print out transformed doc topic distribution, you will see a lot of > topics are not used. > And since there is no words assigned to those topics, the weights will be > equal to`topic_word_prior` parameter. > > You can print out the transformed doc topic distributions like this: > ------------- >>>> doc_distr = lda.fit_transform(tf) > >>>> for d in doc_distr: > ... print np.where(d > 0.001)[0] > ... > [17 27] > [17 27] > [17 27 28] > [14] > [ 2 4 28] > [ 2 4 15 21 27 28] > [1] > [ 1 2 17 21 27 28] > [ 2 15 17 22 28] > [ 2 17 21 22 27 28] > [ 2 15 17 28] > [ 2 17 21 27 28] > [ 2 14 15 17 21 22 27 28] > [15 22] > [ 8 11] > [8] > [ 8 24] > [ 2 14 15 22] > > and my full test scripts are here: > https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 > > Best, > Chyi-Kwei > > > On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad wrote: > >> Hi there, >> >> I'm trying out sklearn's latent Dirichlet allocation implementation for >> topic modeling. The code from the official example [1] works just fine and >> the extracted topics look reasonable. However, when I try other corpora, >> for example the Gutenberg corpus from NLTK, most of the extracted topics >> are garbage. See this example output, when trying to get 30 topics: >> >> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >> fatiguing (0.01) >> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >> (301.83) >> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >> fatiguing (0.01) >> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >> (55.27) >> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles >> (166.21) >> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >> fatiguing (0.01) >> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >> fatiguing (0.01) >> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >> fatiguing (0.01) >> ... >> >> Many topics tend to have the same weights, all equal to the >> `topic_word_prior` parameter. >> >> This is my script: >> >> import nltk >> from sklearn.feature_extraction.text import CountVectorizer >> from sklearn.decomposition import LatentDirichletAllocation >> >> def print_top_words(model, feature_names, n_top_words): >> for topic_idx, topic in enumerate(model.components_): >> message = "Topic #%d: " % topic_idx >> message += " ".join([feature_names[i] + " (" + str(round(topic[i], >> 2)) + ")" >> for i in topic.argsort()[:-n_top_words - >> 1:-1]]) >> print(message) >> >> >> data_samples = [nltk.corpus.gutenberg.raw(f_id) >> for f_id in nltk.corpus.gutenberg.fileids()] >> >> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >> stop_words='english') >> tf = tf_vectorizer.fit_transform(data_samples) >> >> lda = LatentDirichletAllocation(n_components=30, >> learning_method='batch', >> n_jobs=-1, # all CPUs >> verbose=1, >> evaluate_every=10, >> max_iter=1000, >> doc_topic_prior=0.1, >> topic_word_prior=0.01, >> random_state=1) >> >> lda.fit(tf) >> tf_feature_names = tf_vectorizer.get_feature_names() >> print_top_words(lda, tf_feature_names, 5) >> >> Is there a problem in how I set up the LatentDirichletAllocation instance >> or pass the data? I tried out different parameter settings, but none of >> them provided good results for that corpus. I also tried out alternative >> implementations (like the lda package [2]) and those were able to find >> reasonable topics. >> >> Best, >> Markus >> >> >> [1] >> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >> [2] http://pythonhosted.org/lda/ From t3kcit at gmail.com Mon Sep 18 12:59:44 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 18 Sep 2017 12:59:44 -0400 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: References: Message-ID: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> For very few documents, Gibbs sampling is likely to work better - or rather, Gibbs sampling usually works better given enough runtime, and for so few documents, runtime is not an issue. The length of the documents don't matter, only the size of the vocabulary. Also, hyper parameter choices might need to be different for Gibbs sampling vs variational inference. On 09/18/2017 12:26 PM, Markus Konrad wrote: > Hi Chyi-Kwei, > > thanks for digging into this. I made similar observations with Gensim > when using only a small number of (big) documents. Gensim also uses the > Online Variational Bayes approach (Hoffman et al.). So could it be that > the Hoffman et al. method is problematic in such scenarios? I found that > Gibbs sampling based implementations provide much more informative > topics in this case. > > If this was the case, then if I'd slice the documents in some way (say > every N paragraphs become a "document") then I should get better results > with scikit-learn and Gensim, right? I think I'll try this out tomorrow. > > Best, > Markus > > > >> Date: Sun, 17 Sep 2017 23:52:51 +0000 >> From: chyi-kwei yau >> To: Scikit-learn mailing list >> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >> topics in NLTK Gutenberg corpus? >> Message-ID: >> >> Content-Type: text/plain; charset="utf-8" >> >> Hi Markus, >> >> I tried your code and find the issue might be there are only 18 docs >> in the Gutenberg >> corpus. >> if you print out transformed doc topic distribution, you will see a lot of >> topics are not used. >> And since there is no words assigned to those topics, the weights will be >> equal to`topic_word_prior` parameter. >> >> You can print out the transformed doc topic distributions like this: >> ------------- >>>>> doc_distr = lda.fit_transform(tf) >>>>> for d in doc_distr: >> ... print np.where(d > 0.001)[0] >> ... >> [17 27] >> [17 27] >> [17 27 28] >> [14] >> [ 2 4 28] >> [ 2 4 15 21 27 28] >> [1] >> [ 1 2 17 21 27 28] >> [ 2 15 17 22 28] >> [ 2 17 21 22 27 28] >> [ 2 15 17 28] >> [ 2 17 21 27 28] >> [ 2 14 15 17 21 22 27 28] >> [15 22] >> [ 8 11] >> [8] >> [ 8 24] >> [ 2 14 15 22] >> >> and my full test scripts are here: >> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >> >> Best, >> Chyi-Kwei >> >> >> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad wrote: >> >>> Hi there, >>> >>> I'm trying out sklearn's latent Dirichlet allocation implementation for >>> topic modeling. The code from the official example [1] works just fine and >>> the extracted topics look reasonable. However, when I try other corpora, >>> for example the Gutenberg corpus from NLTK, most of the extracted topics >>> are garbage. See this example output, when trying to get 30 topics: >>> >>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >>> fatiguing (0.01) >>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>> (301.83) >>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >>> fatiguing (0.01) >>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>> (55.27) >>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) charles >>> (166.21) >>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >>> fatiguing (0.01) >>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >>> fatiguing (0.01) >>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues (0.01) >>> fatiguing (0.01) >>> ... >>> >>> Many topics tend to have the same weights, all equal to the >>> `topic_word_prior` parameter. >>> >>> This is my script: >>> >>> import nltk >>> from sklearn.feature_extraction.text import CountVectorizer >>> from sklearn.decomposition import LatentDirichletAllocation >>> >>> def print_top_words(model, feature_names, n_top_words): >>> for topic_idx, topic in enumerate(model.components_): >>> message = "Topic #%d: " % topic_idx >>> message += " ".join([feature_names[i] + " (" + str(round(topic[i], >>> 2)) + ")" >>> for i in topic.argsort()[:-n_top_words - >>> 1:-1]]) >>> print(message) >>> >>> >>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>> for f_id in nltk.corpus.gutenberg.fileids()] >>> >>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>> stop_words='english') >>> tf = tf_vectorizer.fit_transform(data_samples) >>> >>> lda = LatentDirichletAllocation(n_components=30, >>> learning_method='batch', >>> n_jobs=-1, # all CPUs >>> verbose=1, >>> evaluate_every=10, >>> max_iter=1000, >>> doc_topic_prior=0.1, >>> topic_word_prior=0.01, >>> random_state=1) >>> >>> lda.fit(tf) >>> tf_feature_names = tf_vectorizer.get_feature_names() >>> print_top_words(lda, tf_feature_names, 5) >>> >>> Is there a problem in how I set up the LatentDirichletAllocation instance >>> or pass the data? I tried out different parameter settings, but none of >>> them provided good results for that corpus. I also tried out alternative >>> implementations (like the lda package [2]) and those were able to find >>> reasonable topics. >>> >>> Best, >>> Markus >>> >>> >>> [1] >>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>> [2] http://pythonhosted.org/lda/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From markus.konrad at wzb.eu Tue Sep 19 04:26:41 2017 From: markus.konrad at wzb.eu (Markus Konrad) Date: Tue, 19 Sep 2017 10:26:41 +0200 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> References: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> Message-ID: This is indeed interesting. I didn't know that there are so big differences between these approaches. I split the 18 documents into sub-documents of 5 paragraphs each, so that I got around 10k of these sub-documents. Now, scikit-learn and gensim deliver much better results, quite similar to those from a Gibbs sampling based implementation. So it was basically the same data, just split in a different way. I think the disadvantages/limits of the Variational Bayes approach should be mentioned in the documentation. Best, Markus On 09/18/2017 06:59 PM, Andreas Mueller wrote: > For very few documents, Gibbs sampling is likely to work better - or > rather, Gibbs sampling usually works > better given enough runtime, and for so few documents, runtime is not an > issue. > The length of the documents don't matter, only the size of the vocabulary. > Also, hyper parameter choices might need to be different for Gibbs > sampling vs variational inference. > > On 09/18/2017 12:26 PM, Markus Konrad wrote: >> Hi Chyi-Kwei, >> >> thanks for digging into this. I made similar observations with Gensim >> when using only a small number of (big) documents. Gensim also uses the >> Online Variational Bayes approach (Hoffman et al.). So could it be that >> the Hoffman et al. method is problematic in such scenarios? I found that >> Gibbs sampling based implementations provide much more informative >> topics in this case. >> >> If this was the case, then if I'd slice the documents in some way (say >> every N paragraphs become a "document") then I should get better results >> with scikit-learn and Gensim, right? I think I'll try this out tomorrow. >> >> Best, >> Markus >> >> >> >>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>> From: chyi-kwei yau >>> To: Scikit-learn mailing list >>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>> ????topics in NLTK Gutenberg corpus? >>> Message-ID: >>> ???? >>> Content-Type: text/plain; charset="utf-8" >>> >>> Hi Markus, >>> >>> I tried your code and find the issue might be there are only 18 docs >>> in the Gutenberg >>> corpus. >>> if you print out transformed doc topic distribution, you will see a >>> lot of >>> topics are not used. >>> And since there is no words assigned to those topics, the weights >>> will be >>> equal to`topic_word_prior` parameter. >>> >>> You can print out the transformed doc topic distributions like this: >>> ------------- >>>>>> doc_distr = lda.fit_transform(tf) >>>>>> for d in doc_distr: >>> ...???? print np.where(d > 0.001)[0] >>> ... >>> [17 27] >>> [17 27] >>> [17 27 28] >>> [14] >>> [ 2? 4 28] >>> [ 2? 4 15 21 27 28] >>> [1] >>> [ 1? 2 17 21 27 28] >>> [ 2 15 17 22 28] >>> [ 2 17 21 22 27 28] >>> [ 2 15 17 28] >>> [ 2 17 21 27 28] >>> [ 2 14 15 17 21 22 27 28] >>> [15 22] >>> [ 8 11] >>> [8] >>> [ 8 24] >>> [ 2 14 15 22] >>> >>> and my full test scripts are here: >>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>> >>> Best, >>> Chyi-Kwei >>> >>> >>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad >>> wrote: >>> >>>> Hi there, >>>> >>>> I'm trying out sklearn's latent Dirichlet allocation implementation for >>>> topic modeling. The code from the official example [1] works just >>>> fine and >>>> the extracted topics look reasonable. However, when I try other >>>> corpora, >>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>> topics >>>> are garbage. See this example output, when trying to get 30 topics: >>>> >>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>> (301.83) >>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>> (55.27) >>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>> charles >>>> (166.21) >>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>> (0.01) >>>> fatiguing (0.01) >>>> ... >>>> >>>> Many topics tend to have the same weights, all equal to the >>>> `topic_word_prior` parameter. >>>> >>>> This is my script: >>>> >>>> import nltk >>>> from sklearn.feature_extraction.text import CountVectorizer >>>> from sklearn.decomposition import LatentDirichletAllocation >>>> >>>> def print_top_words(model, feature_names, n_top_words): >>>> ???? for topic_idx, topic in enumerate(model.components_): >>>> ???????? message = "Topic #%d: " % topic_idx >>>> ???????? message += " ".join([feature_names[i] + " (" + >>>> str(round(topic[i], >>>> 2)) + ")" >>>> ????????????????????????????? for i in topic.argsort()[:-n_top_words - >>>> 1:-1]]) >>>> ???????? print(message) >>>> >>>> >>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>> ??????????????? for f_id in nltk.corpus.gutenberg.fileids()] >>>> >>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>> ???????????????????????????????? stop_words='english') >>>> tf = tf_vectorizer.fit_transform(data_samples) >>>> >>>> lda = LatentDirichletAllocation(n_components=30, >>>> ???????????????????????????????? learning_method='batch', >>>> ???????????????????????????????? n_jobs=-1,? # all CPUs >>>> ???????????????????????????????? verbose=1, >>>> ???????????????????????????????? evaluate_every=10, >>>> ???????????????????????????????? max_iter=1000, >>>> ???????????????????????????????? doc_topic_prior=0.1, >>>> ???????????????????????????????? topic_word_prior=0.01, >>>> ???????????????????????????????? random_state=1) >>>> >>>> lda.fit(tf) >>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>> print_top_words(lda, tf_feature_names, 5) >>>> >>>> Is there a problem in how I set up the LatentDirichletAllocation >>>> instance >>>> or pass the data? I tried out different parameter settings, but none of >>>> them provided good results for that corpus. I also tried out >>>> alternative >>>> implementations (like the lda package [2]) and those were able to find >>>> reasonable topics. >>>> >>>> Best, >>>> Markus >>>> >>>> >>>> [1] >>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>> >>>> [2] http://pythonhosted.org/lda/ >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Tue Sep 19 12:07:51 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 19 Sep 2017 12:07:51 -0400 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: References: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> Message-ID: I'm actually surprised the gibbs sampling gave useful results with so little data. And splitting the documents results in very different data. It has a lot more information. How many topics did you use? Also: PR for docs welcome! On 09/19/2017 04:26 AM, Markus Konrad wrote: > This is indeed interesting. I didn't know that there are so big > differences between these approaches. I split the 18 documents into > sub-documents of 5 paragraphs each, so that I got around 10k of these > sub-documents. Now, scikit-learn and gensim deliver much better results, > quite similar to those from a Gibbs sampling based implementation. So it > was basically the same data, just split in a different way. > > I think the disadvantages/limits of the Variational Bayes approach > should be mentioned in the documentation. > > Best, > Markus > > > > On 09/18/2017 06:59 PM, Andreas Mueller wrote: >> For very few documents, Gibbs sampling is likely to work better - or >> rather, Gibbs sampling usually works >> better given enough runtime, and for so few documents, runtime is not an >> issue. >> The length of the documents don't matter, only the size of the vocabulary. >> Also, hyper parameter choices might need to be different for Gibbs >> sampling vs variational inference. >> >> On 09/18/2017 12:26 PM, Markus Konrad wrote: >>> Hi Chyi-Kwei, >>> >>> thanks for digging into this. I made similar observations with Gensim >>> when using only a small number of (big) documents. Gensim also uses the >>> Online Variational Bayes approach (Hoffman et al.). So could it be that >>> the Hoffman et al. method is problematic in such scenarios? I found that >>> Gibbs sampling based implementations provide much more informative >>> topics in this case. >>> >>> If this was the case, then if I'd slice the documents in some way (say >>> every N paragraphs become a "document") then I should get better results >>> with scikit-learn and Gensim, right? I think I'll try this out tomorrow. >>> >>> Best, >>> Markus >>> >>> >>> >>>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>>> From: chyi-kwei yau >>>> To: Scikit-learn mailing list >>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>>> ????topics in NLTK Gutenberg corpus? >>>> Message-ID: >>>> ???? >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Hi Markus, >>>> >>>> I tried your code and find the issue might be there are only 18 docs >>>> in the Gutenberg >>>> corpus. >>>> if you print out transformed doc topic distribution, you will see a >>>> lot of >>>> topics are not used. >>>> And since there is no words assigned to those topics, the weights >>>> will be >>>> equal to`topic_word_prior` parameter. >>>> >>>> You can print out the transformed doc topic distributions like this: >>>> ------------- >>>>>>> doc_distr = lda.fit_transform(tf) >>>>>>> for d in doc_distr: >>>> ...???? print np.where(d > 0.001)[0] >>>> ... >>>> [17 27] >>>> [17 27] >>>> [17 27 28] >>>> [14] >>>> [ 2? 4 28] >>>> [ 2? 4 15 21 27 28] >>>> [1] >>>> [ 1? 2 17 21 27 28] >>>> [ 2 15 17 22 28] >>>> [ 2 17 21 22 27 28] >>>> [ 2 15 17 28] >>>> [ 2 17 21 27 28] >>>> [ 2 14 15 17 21 22 27 28] >>>> [15 22] >>>> [ 8 11] >>>> [8] >>>> [ 8 24] >>>> [ 2 14 15 22] >>>> >>>> and my full test scripts are here: >>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>>> >>>> Best, >>>> Chyi-Kwei >>>> >>>> >>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad >>>> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I'm trying out sklearn's latent Dirichlet allocation implementation for >>>>> topic modeling. The code from the official example [1] works just >>>>> fine and >>>>> the extracted topics look reasonable. However, when I try other >>>>> corpora, >>>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>>> topics >>>>> are garbage. See this example output, when trying to get 30 topics: >>>>> >>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>> (0.01) >>>>> fatiguing (0.01) >>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>>> (301.83) >>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>> (0.01) >>>>> fatiguing (0.01) >>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>>> (55.27) >>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>>> charles >>>>> (166.21) >>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>> (0.01) >>>>> fatiguing (0.01) >>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>> (0.01) >>>>> fatiguing (0.01) >>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>> (0.01) >>>>> fatiguing (0.01) >>>>> ... >>>>> >>>>> Many topics tend to have the same weights, all equal to the >>>>> `topic_word_prior` parameter. >>>>> >>>>> This is my script: >>>>> >>>>> import nltk >>>>> from sklearn.feature_extraction.text import CountVectorizer >>>>> from sklearn.decomposition import LatentDirichletAllocation >>>>> >>>>> def print_top_words(model, feature_names, n_top_words): >>>>> ???? for topic_idx, topic in enumerate(model.components_): >>>>> ???????? message = "Topic #%d: " % topic_idx >>>>> ???????? message += " ".join([feature_names[i] + " (" + >>>>> str(round(topic[i], >>>>> 2)) + ")" >>>>> ????????????????????????????? for i in topic.argsort()[:-n_top_words - >>>>> 1:-1]]) >>>>> ???????? print(message) >>>>> >>>>> >>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>>> ??????????????? for f_id in nltk.corpus.gutenberg.fileids()] >>>>> >>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>>> ???????????????????????????????? stop_words='english') >>>>> tf = tf_vectorizer.fit_transform(data_samples) >>>>> >>>>> lda = LatentDirichletAllocation(n_components=30, >>>>> ???????????????????????????????? learning_method='batch', >>>>> ???????????????????????????????? n_jobs=-1,? # all CPUs >>>>> ???????????????????????????????? verbose=1, >>>>> ???????????????????????????????? evaluate_every=10, >>>>> ???????????????????????????????? max_iter=1000, >>>>> ???????????????????????????????? doc_topic_prior=0.1, >>>>> ???????????????????????????????? topic_word_prior=0.01, >>>>> ???????????????????????????????? random_state=1) >>>>> >>>>> lda.fit(tf) >>>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>>> print_top_words(lda, tf_feature_names, 5) >>>>> >>>>> Is there a problem in how I set up the LatentDirichletAllocation >>>>> instance >>>>> or pass the data? I tried out different parameter settings, but none of >>>>> them provided good results for that corpus. I also tried out >>>>> alternative >>>>> implementations (like the lda package [2]) and those were able to find >>>>> reasonable topics. >>>>> >>>>> Best, >>>>> Markus >>>>> >>>>> >>>>> [1] >>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>>> >>>>> [2] http://pythonhosted.org/lda/ >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From markus.konrad at wzb.eu Wed Sep 20 03:18:40 2017 From: markus.konrad at wzb.eu (Markus Konrad) Date: Wed, 20 Sep 2017 09:18:40 +0200 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: References: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> Message-ID: <4d27076b-52c7-b09f-1738-15fd0e062a58@wzb.eu> I tried it with 12 topics (that's the number that minimized the log likelihood) and there were also some very general topics. But the Gibbs sampling didn't extract "empty topics" (those with all weights equal to `topic_word_prior`) as opposed to sklearn's implementation. This is what puzzled me. It isn't actually "little" data. The documents themselves are quite big. But I think that this is where my thinking went wrong initially. I thought that if 18 big documents cover a certain set of topics, then if I split these documents into more, but smaller documents, a similar set of topics should be discovered. But you're right, the latter contains more information. Taken to an extreme: If I had only 1 document, it wouldn't be possible to find the topics in there with LDA. Best, Markus On 09/19/2017 06:07 PM, Andreas Mueller wrote: > I'm actually surprised the gibbs sampling gave useful results with so > little data. > And splitting the documents results in very different data. It has a lot > more information. > How many topics did you use? > > Also: PR for docs welcome! > > On 09/19/2017 04:26 AM, Markus Konrad wrote: >> This is indeed interesting. I didn't know that there are so big >> differences between these approaches. I split the 18 documents into >> sub-documents of 5 paragraphs each, so that I got around 10k of these >> sub-documents. Now, scikit-learn and gensim deliver much better results, >> quite similar to those from a Gibbs sampling based implementation. So it >> was basically the same data, just split in a different way. >> >> I think the disadvantages/limits of the Variational Bayes approach >> should be mentioned in the documentation. >> >> Best, >> Markus >> >> >> >> On 09/18/2017 06:59 PM, Andreas Mueller wrote: >>> For very few documents, Gibbs sampling is likely to work better - or >>> rather, Gibbs sampling usually works >>> better given enough runtime, and for so few documents, runtime is not an >>> issue. >>> The length of the documents don't matter, only the size of the >>> vocabulary. >>> Also, hyper parameter choices might need to be different for Gibbs >>> sampling vs variational inference. >>> >>> On 09/18/2017 12:26 PM, Markus Konrad wrote: >>>> Hi Chyi-Kwei, >>>> >>>> thanks for digging into this. I made similar observations with Gensim >>>> when using only a small number of (big) documents. Gensim also uses the >>>> Online Variational Bayes approach (Hoffman et al.). So could it be that >>>> the Hoffman et al. method is problematic in such scenarios? I found >>>> that >>>> Gibbs sampling based implementations provide much more informative >>>> topics in this case. >>>> >>>> If this was the case, then if I'd slice the documents in some way (say >>>> every N paragraphs become a "document") then I should get better >>>> results >>>> with scikit-learn and Gensim, right? I think I'll try this out >>>> tomorrow. >>>> >>>> Best, >>>> Markus >>>> >>>> >>>> >>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>>>> From: chyi-kwei yau >>>>> To: Scikit-learn mailing list >>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>>>> ?????topics in NLTK Gutenberg corpus? >>>>> Message-ID: >>>>> ????? >>>>> >>>>> Content-Type: text/plain; charset="utf-8" >>>>> >>>>> Hi Markus, >>>>> >>>>> I tried your code and find the issue might be there are only 18 docs >>>>> in the Gutenberg >>>>> corpus. >>>>> if you print out transformed doc topic distribution, you will see a >>>>> lot of >>>>> topics are not used. >>>>> And since there is no words assigned to those topics, the weights >>>>> will be >>>>> equal to`topic_word_prior` parameter. >>>>> >>>>> You can print out the transformed doc topic distributions like this: >>>>> ------------- >>>>>>>> doc_distr = lda.fit_transform(tf) >>>>>>>> for d in doc_distr: >>>>> ...???? print np.where(d > 0.001)[0] >>>>> ... >>>>> [17 27] >>>>> [17 27] >>>>> [17 27 28] >>>>> [14] >>>>> [ 2? 4 28] >>>>> [ 2? 4 15 21 27 28] >>>>> [1] >>>>> [ 1? 2 17 21 27 28] >>>>> [ 2 15 17 22 28] >>>>> [ 2 17 21 22 27 28] >>>>> [ 2 15 17 28] >>>>> [ 2 17 21 27 28] >>>>> [ 2 14 15 17 21 22 27 28] >>>>> [15 22] >>>>> [ 8 11] >>>>> [8] >>>>> [ 8 24] >>>>> [ 2 14 15 22] >>>>> >>>>> and my full test scripts are here: >>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>>>> >>>>> Best, >>>>> Chyi-Kwei >>>>> >>>>> >>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad >>>>> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I'm trying out sklearn's latent Dirichlet allocation >>>>>> implementation for >>>>>> topic modeling. The code from the official example [1] works just >>>>>> fine and >>>>>> the extracted topics look reasonable. However, when I try other >>>>>> corpora, >>>>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>>>> topics >>>>>> are garbage. See this example output, when trying to get 30 topics: >>>>>> >>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>> (0.01) >>>>>> fatiguing (0.01) >>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>>>> (301.83) >>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>> (0.01) >>>>>> fatiguing (0.01) >>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>>>> (55.27) >>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>>>> charles >>>>>> (166.21) >>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>> (0.01) >>>>>> fatiguing (0.01) >>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>> (0.01) >>>>>> fatiguing (0.01) >>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>> (0.01) >>>>>> fatiguing (0.01) >>>>>> ... >>>>>> >>>>>> Many topics tend to have the same weights, all equal to the >>>>>> `topic_word_prior` parameter. >>>>>> >>>>>> This is my script: >>>>>> >>>>>> import nltk >>>>>> from sklearn.feature_extraction.text import CountVectorizer >>>>>> from sklearn.decomposition import LatentDirichletAllocation >>>>>> >>>>>> def print_top_words(model, feature_names, n_top_words): >>>>>> ????? for topic_idx, topic in enumerate(model.components_): >>>>>> ????????? message = "Topic #%d: " % topic_idx >>>>>> ????????? message += " ".join([feature_names[i] + " (" + >>>>>> str(round(topic[i], >>>>>> 2)) + ")" >>>>>> ?????????????????????????????? for i in >>>>>> topic.argsort()[:-n_top_words - >>>>>> 1:-1]]) >>>>>> ????????? print(message) >>>>>> >>>>>> >>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>>>> ???????????????? for f_id in nltk.corpus.gutenberg.fileids()] >>>>>> >>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>>>> ????????????????????????????????? stop_words='english') >>>>>> tf = tf_vectorizer.fit_transform(data_samples) >>>>>> >>>>>> lda = LatentDirichletAllocation(n_components=30, >>>>>> ????????????????????????????????? learning_method='batch', >>>>>> ????????????????????????????????? n_jobs=-1,? # all CPUs >>>>>> ????????????????????????????????? verbose=1, >>>>>> ????????????????????????????????? evaluate_every=10, >>>>>> ????????????????????????????????? max_iter=1000, >>>>>> ????????????????????????????????? doc_topic_prior=0.1, >>>>>> ????????????????????????????????? topic_word_prior=0.01, >>>>>> ????????????????????????????????? random_state=1) >>>>>> >>>>>> lda.fit(tf) >>>>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>>>> print_top_words(lda, tf_feature_names, 5) >>>>>> >>>>>> Is there a problem in how I set up the LatentDirichletAllocation >>>>>> instance >>>>>> or pass the data? I tried out different parameter settings, but >>>>>> none of >>>>>> them provided good results for that corpus. I also tried out >>>>>> alternative >>>>>> implementations (like the lda package [2]) and those were able to >>>>>> find >>>>>> reasonable topics. >>>>>> >>>>>> Best, >>>>>> Markus >>>>>> >>>>>> >>>>>> [1] >>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>>>> >>>>>> >>>>>> [2] http://pythonhosted.org/lda/ >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From markus.konrad at wzb.eu Wed Sep 20 06:20:55 2017 From: markus.konrad at wzb.eu (Markus Konrad) Date: Wed, 20 Sep 2017 12:20:55 +0200 Subject: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus? In-Reply-To: <4d27076b-52c7-b09f-1738-15fd0e062a58@wzb.eu> References: <8b6f7eda-0152-41bd-d44d-16d870ffcbe6@gmail.com> <4d27076b-52c7-b09f-1738-15fd0e062a58@wzb.eu> Message-ID: Sorry, I meant of course "the number that *maximized* the log likelihood" in the first sentence... On 09/20/2017 09:18 AM, Markus Konrad wrote: > I tried it with 12 topics (that's the number that minimized the log > likelihood) and there were also some very general topics. But the Gibbs > sampling didn't extract "empty topics" (those with all weights equal to > `topic_word_prior`) as opposed to sklearn's implementation. This is what > puzzled me. > > It isn't actually "little" data. The documents themselves are quite big. > But I think that this is where my thinking went wrong initially. I > thought that if 18 big documents cover a certain set of topics, then if > I split these documents into more, but smaller documents, a similar set > of topics should be discovered. But you're right, the latter contains > more information. Taken to an extreme: If I had only 1 document, it > wouldn't be possible to find the topics in there with LDA. > > Best, > Markus > > > > On 09/19/2017 06:07 PM, Andreas Mueller wrote: >> I'm actually surprised the gibbs sampling gave useful results with so >> little data. >> And splitting the documents results in very different data. It has a lot >> more information. >> How many topics did you use? >> >> Also: PR for docs welcome! >> >> On 09/19/2017 04:26 AM, Markus Konrad wrote: >>> This is indeed interesting. I didn't know that there are so big >>> differences between these approaches. I split the 18 documents into >>> sub-documents of 5 paragraphs each, so that I got around 10k of these >>> sub-documents. Now, scikit-learn and gensim deliver much better results, >>> quite similar to those from a Gibbs sampling based implementation. So it >>> was basically the same data, just split in a different way. >>> >>> I think the disadvantages/limits of the Variational Bayes approach >>> should be mentioned in the documentation. >>> >>> Best, >>> Markus >>> >>> >>> >>> On 09/18/2017 06:59 PM, Andreas Mueller wrote: >>>> For very few documents, Gibbs sampling is likely to work better - or >>>> rather, Gibbs sampling usually works >>>> better given enough runtime, and for so few documents, runtime is not an >>>> issue. >>>> The length of the documents don't matter, only the size of the >>>> vocabulary. >>>> Also, hyper parameter choices might need to be different for Gibbs >>>> sampling vs variational inference. >>>> >>>> On 09/18/2017 12:26 PM, Markus Konrad wrote: >>>>> Hi Chyi-Kwei, >>>>> >>>>> thanks for digging into this. I made similar observations with Gensim >>>>> when using only a small number of (big) documents. Gensim also uses the >>>>> Online Variational Bayes approach (Hoffman et al.). So could it be that >>>>> the Hoffman et al. method is problematic in such scenarios? I found >>>>> that >>>>> Gibbs sampling based implementations provide much more informative >>>>> topics in this case. >>>>> >>>>> If this was the case, then if I'd slice the documents in some way (say >>>>> every N paragraphs become a "document") then I should get better >>>>> results >>>>> with scikit-learn and Gensim, right? I think I'll try this out >>>>> tomorrow. >>>>> >>>>> Best, >>>>> Markus >>>>> >>>>> >>>>> >>>>>> Date: Sun, 17 Sep 2017 23:52:51 +0000 >>>>>> From: chyi-kwei yau >>>>>> To: Scikit-learn mailing list >>>>>> Subject: Re: [scikit-learn] LatentDirichletAllocation failing to find >>>>>> ?????topics in NLTK Gutenberg corpus? >>>>>> Message-ID: >>>>>> ????? >>>>>> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Hi Markus, >>>>>> >>>>>> I tried your code and find the issue might be there are only 18 docs >>>>>> in the Gutenberg >>>>>> corpus. >>>>>> if you print out transformed doc topic distribution, you will see a >>>>>> lot of >>>>>> topics are not used. >>>>>> And since there is no words assigned to those topics, the weights >>>>>> will be >>>>>> equal to`topic_word_prior` parameter. >>>>>> >>>>>> You can print out the transformed doc topic distributions like this: >>>>>> ------------- >>>>>>>>> doc_distr = lda.fit_transform(tf) >>>>>>>>> for d in doc_distr: >>>>>> ...???? print np.where(d > 0.001)[0] >>>>>> ... >>>>>> [17 27] >>>>>> [17 27] >>>>>> [17 27 28] >>>>>> [14] >>>>>> [ 2? 4 28] >>>>>> [ 2? 4 15 21 27 28] >>>>>> [1] >>>>>> [ 1? 2 17 21 27 28] >>>>>> [ 2 15 17 22 28] >>>>>> [ 2 17 21 22 27 28] >>>>>> [ 2 15 17 28] >>>>>> [ 2 17 21 27 28] >>>>>> [ 2 14 15 17 21 22 27 28] >>>>>> [15 22] >>>>>> [ 8 11] >>>>>> [8] >>>>>> [ 8 24] >>>>>> [ 2 14 15 22] >>>>>> >>>>>> and my full test scripts are here: >>>>>> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 >>>>>> >>>>>> Best, >>>>>> Chyi-Kwei >>>>>> >>>>>> >>>>>> On Thu, Sep 14, 2017 at 7:26 AM Markus Konrad >>>>>> wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> I'm trying out sklearn's latent Dirichlet allocation >>>>>>> implementation for >>>>>>> topic modeling. The code from the official example [1] works just >>>>>>> fine and >>>>>>> the extracted topics look reasonable. However, when I try other >>>>>>> corpora, >>>>>>> for example the Gutenberg corpus from NLTK, most of the extracted >>>>>>> topics >>>>>>> are garbage. See this example output, when trying to get 30 topics: >>>>>>> >>>>>>> Topic #0: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #1: mr (1081.61) emma (866.01) miss (506.94) mrs (445.56) jane >>>>>>> (301.83) >>>>>>> Topic #2: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #3: thee (82.64) thou (70.0) thy (66.66) father (56.45) mother >>>>>>> (55.27) >>>>>>> Topic #4: anne (498.74) captain (303.01) lady (173.96) mr (172.07) >>>>>>> charles >>>>>>> (166.21) >>>>>>> Topic #5: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #6: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> Topic #7: zoological (0.01) fathoms (0.01) fatigued (0.01) fatigues >>>>>>> (0.01) >>>>>>> fatiguing (0.01) >>>>>>> ... >>>>>>> >>>>>>> Many topics tend to have the same weights, all equal to the >>>>>>> `topic_word_prior` parameter. >>>>>>> >>>>>>> This is my script: >>>>>>> >>>>>>> import nltk >>>>>>> from sklearn.feature_extraction.text import CountVectorizer >>>>>>> from sklearn.decomposition import LatentDirichletAllocation >>>>>>> >>>>>>> def print_top_words(model, feature_names, n_top_words): >>>>>>> ????? for topic_idx, topic in enumerate(model.components_): >>>>>>> ????????? message = "Topic #%d: " % topic_idx >>>>>>> ????????? message += " ".join([feature_names[i] + " (" + >>>>>>> str(round(topic[i], >>>>>>> 2)) + ")" >>>>>>> ?????????????????????????????? for i in >>>>>>> topic.argsort()[:-n_top_words - >>>>>>> 1:-1]]) >>>>>>> ????????? print(message) >>>>>>> >>>>>>> >>>>>>> data_samples = [nltk.corpus.gutenberg.raw(f_id) >>>>>>> ???????????????? for f_id in nltk.corpus.gutenberg.fileids()] >>>>>>> >>>>>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, >>>>>>> ????????????????????????????????? stop_words='english') >>>>>>> tf = tf_vectorizer.fit_transform(data_samples) >>>>>>> >>>>>>> lda = LatentDirichletAllocation(n_components=30, >>>>>>> ????????????????????????????????? learning_method='batch', >>>>>>> ????????????????????????????????? n_jobs=-1,? # all CPUs >>>>>>> ????????????????????????????????? verbose=1, >>>>>>> ????????????????????????????????? evaluate_every=10, >>>>>>> ????????????????????????????????? max_iter=1000, >>>>>>> ????????????????????????????????? doc_topic_prior=0.1, >>>>>>> ????????????????????????????????? topic_word_prior=0.01, >>>>>>> ????????????????????????????????? random_state=1) >>>>>>> >>>>>>> lda.fit(tf) >>>>>>> tf_feature_names = tf_vectorizer.get_feature_names() >>>>>>> print_top_words(lda, tf_feature_names, 5) >>>>>>> >>>>>>> Is there a problem in how I set up the LatentDirichletAllocation >>>>>>> instance >>>>>>> or pass the data? I tried out different parameter settings, but >>>>>>> none of >>>>>>> them provided good results for that corpus. I also tried out >>>>>>> alternative >>>>>>> implementations (like the lda package [2]) and those were able to >>>>>>> find >>>>>>> reasonable topics. >>>>>>> >>>>>>> Best, >>>>>>> Markus >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py >>>>>>> >>>>>>> >>>>>>> [2] http://pythonhosted.org/lda/ >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn -- -- Markus Konrad - DV / Data Science - fon: +49 30 25491 555 fax: +49 30 25491 558 mail: markus.konrad at wzb.eu WZB Data Science Blog: https://datascience.blog.wzb.eu/ Raum D 005 WZB ? Wissenschaftszentrum Berlin f?r Sozialforschung Reichpietschufer 50 D-10785 Berlin From s.atasever at gmail.com Wed Sep 20 07:40:32 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Wed, 20 Sep 2017 14:40:32 +0300 Subject: [scikit-learn] Accessing Clustering Feature Tree in Birch In-Reply-To: References: Message-ID: I need this information to use it in a scientific study and I think that a function interface would make this easier. Thank you for your answer. On Sat, Sep 16, 2017 at 1:53 PM, Joel Nothman wrote: > There is no such thing as "the data samples in this cluster". The point of > Birch being online is that it loses any reference to the individual samples > that contributed to each node, but stores some statistics on their basis. > Roman Yurchak has, however, offered a PR where, for the non-online case, > storage of the indices contributing to each node can be optionally turned > on: https://github.com/scikit-learn/scikit-learn/pull/8808 > > As for finding what is contained under any particular node, traversing the > tree is a fairly basic task from a computer science perspective. Before we > were to support something to make this much easier, I think we'd need to be > clear on what kinds of use case we were supporting. What do you hope to do > with this information, and what would a function interface look like that > would make this much easier? > > Decimals aren't a practical option as the branching factor may be greater > than 10, it is a hard structure to inspect, and susceptible to > computational imprecision. Better off with a list of tuples, but what for > that is not easy enough to do now? > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hzmao at hotmail.com Thu Sep 21 16:38:33 2017 From: hzmao at hotmail.com (hanzi mao) Date: Thu, 21 Sep 2017 20:38:33 +0000 Subject: [scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder Message-ID: Hi, I am reading the source code of the Decision Tree Regressor in sklearn. To build a tree, there are two fashions: depth first and best first. Best first fashion is adopted only when user set max_leaf_nodes. Otherwise, the tree will be built using the DepthFirstTreeBuilder. My questions are: 1. Are there any practical considerations when to use depth-first or best-first? Dose the depth-first fashion has a overwhelming advantage / popularity compared with the best-first one which makes it a default choice? 2. I am kind of confused why using a optional parameter max_leaf_nodes to decide whether to use BestFirstTreeBuilder or not. I am wondering if there are some considerations when you decide to develop like this. Thanks! Best, Hanna -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Sep 22 13:02:54 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 22 Sep 2017 10:02:54 -0700 Subject: [scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder In-Reply-To: References: Message-ID: Hi Hanna Thanks for the questions! 1) Best first tends to product unbalanced but sparser trees, and frequently produces more generalizable models by only capturing the most important interactions. Unbalanced isn't necessarily bad either. You can imagine that in some parts of the tree where there are complex split rules that are important to learn, but in other parts of the tree the additional splits only improve purity a tiny bit and risk overfitting (and thus being less generalizable). 2) If you let best first and depth first run until purity is reached, they will produce identical trees. The only difference is the ordering of the nodes as they get added to the tree. Best first will add nodes to the tree ordered by their increase in purity, and depth first adds nodes essentially in the order one would do a depth-first search. If one were to stop best first building early, they would get a tree where the important interactions are captured first, whereas if one were to stop a depth-first build early, they would get a really good split of one or maybe a few areas of the dataset (generally speaking). The reason max_leaf_nodes decides if BestFirstSplitter will be used or not is because it doesn't make sense to limit a depth first build by the number of nodes, and it doesn't make sense to run BestFirstSplitter without limiting the number of nodes in the tree. Let me know if you have any further questions! Jacob On Thu, Sep 21, 2017 at 1:38 PM, hanzi mao wrote: > Hi, > > > I am reading the source code of the Decision Tree Regressor in sklearn. To > build a tree, there are two fashions: depth first and best first. Best > first fashion is adopted only when user set max_leaf_nodes. Otherwise, > the tree will be built using the DepthFirstTreeBuilder. My questions are: > > > > 1. Are there any practical considerations when to use depth-first or > best-first? Dose the depth-first fashion has a overwhelming advantage / > popularity compared with the best-first one which makes it a default > choice? > 2. I am kind of confused why using a optional parameter max_leaf_nodes > to decide whether to use BestFirstTreeBuilder or not. I am wondering if > there are some considerations when you decide to develop like this. > > > Thanks! > > Best, > Hanna > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hzmao at hotmail.com Fri Sep 22 15:23:03 2017 From: hzmao at hotmail.com (hanzi mao) Date: Fri, 22 Sep 2017 19:23:03 +0000 Subject: [scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder In-Reply-To: References: , Message-ID: Thanks Jacob! You explain the ideas behind the two builders very well! Best, Hanna ________________________________ From: scikit-learn on behalf of Jacob Schreiber Sent: Friday, September 22, 2017 1:02:54 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder Hi Hanna Thanks for the questions! 1) Best first tends to product unbalanced but sparser trees, and frequently produces more generalizable models by only capturing the most important interactions. Unbalanced isn't necessarily bad either. You can imagine that in some parts of the tree where there are complex split rules that are important to learn, but in other parts of the tree the additional splits only improve purity a tiny bit and risk overfitting (and thus being less generalizable). 2) If you let best first and depth first run until purity is reached, they will produce identical trees. The only difference is the ordering of the nodes as they get added to the tree. Best first will add nodes to the tree ordered by their increase in purity, and depth first adds nodes essentially in the order one would do a depth-first search. If one were to stop best first building early, they would get a tree where the important interactions are captured first, whereas if one were to stop a depth-first build early, they would get a really good split of one or maybe a few areas of the dataset (generally speaking). The reason max_leaf_nodes decides if BestFirstSplitter will be used or not is because it doesn't make sense to limit a depth first build by the number of nodes, and it doesn't make sense to run BestFirstSplitter without limiting the number of nodes in the tree. Let me know if you have any further questions! Jacob On Thu, Sep 21, 2017 at 1:38 PM, hanzi mao > wrote: Hi, I am reading the source code of the Decision Tree Regressor in sklearn. To build a tree, there are two fashions: depth first and best first. Best first fashion is adopted only when user set max_leaf_nodes. Otherwise, the tree will be built using the DepthFirstTreeBuilder. My questions are: 1. Are there any practical considerations when to use depth-first or best-first? Dose the depth-first fashion has a overwhelming advantage / popularity compared with the best-first one which makes it a default choice? 2. I am kind of confused why using a optional parameter max_leaf_nodes to decide whether to use BestFirstTreeBuilder or not. I am wondering if there are some considerations when you decide to develop like this. Thanks! Best, Hanna _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Sep 24 16:35:11 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 24 Sep 2017 22:35:11 +0200 Subject: [scikit-learn] batch_size for small training sets Message-ID: Greetings, I traing MLPRegressors using small datasets, usually with 10-50 observations. The default batch_size=min(200, n_samples) for the adam optimizer, and because my n_samples is always < 200, it is eventually batch_size=n_samples. According to the theory, stochastic gradient-based optimizers like adam perform better in the small batch regime. Considering the above, what would be a good batch_size value in my case (e.g. 4)? Is there any rule of thump to select the batch_size when the n_samples is small or must the choice be based on trial and error? -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Sep 24 16:47:05 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 24 Sep 2017 16:47:05 -0400 Subject: [scikit-learn] batch_size for small training sets In-Reply-To: References: Message-ID: Small batch sizes are typically used to speed up the training (more iterations) and to avoid the issue that training sets usually don?t fit into memory. Okay, the additional noise from the stochastic approach may also be helpful to escape local minima and/or help with generalization performance (eg as discussed in the recent paper where the authors compared SGD to other optimizers). In any case, since batch size is effectively a hyper parameter I would just experiment with a few values and compare. Also, since you have a small dataset, I would maybe also try to just go with batch gradient descent (I.e batch size = n training samples). Best, Sebastian Sent from my iPhone > On Sep 24, 2017, at 4:35 PM, Thomas Evangelidis wrote: > > Greetings, > > I traing MLPRegressors using small datasets, usually with 10-50 observations. The default batch_size=min(200, n_samples) for the adam optimizer, and because my n_samples is always < 200, it is eventually batch_size=n_samples. According to the theory, stochastic gradient-based optimizers like adam perform better in the small batch regime. Considering the above, what would be a good batch_size value in my case (e.g. 4)? Is there any rule of thump to select the batch_size when the n_samples is small or must the choice be based on trial and error? > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Tue Sep 26 12:10:39 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 26 Sep 2017 18:10:39 +0200 Subject: [scikit-learn] anti-correlated predictions by SVR Message-ID: Greetings, I don't know if anyone encountered this before, but sometimes I get anti-correlated predictions by the SVR I that am training. Namely, the Pearson's R and Kendall's tau are negative when I compare the predictions on the external test set with the true values. However, the SVR predictions on the training set have positive correlations with the experimental values and hence I can't think of a way to know in advance if the trained SVR will produce anti-correlated predictions in order to change their sign and avoid the disaster. Here is an example of what I mean: Training set predictions: R=0.452422, tau=0.333333 External test set predictions: R=-0.537420, tau-0.300000 Obviously, in a real case scenario where I wouldn't have the external test set I would have used the worst observation instead of the best ones. Has anybody any idea about how I could prevent this? thanks in advance Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Sep 26 12:21:37 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 26 Sep 2017 18:21:37 +0200 Subject: [scikit-learn] anti-correlated predictions by SVR In-Reply-To: References: Message-ID: <20170926162137.GC2886336@phare.normalesup.org> Hypothesis: you have a very small dataset and when you leave out data, you create a distribution shift between the train and the test. A simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out cross-validation will create a training set of 10 samples of one class, 9 samples of the other, and the test set is composed of the class that is minority on the train set. G On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote: > Greetings, > I don't know if anyone encountered this before, but sometimes I get > anti-correlated predictions by the SVR I that am training. Namely, the > Pearson's R and Kendall's tau are negative when I compare the predictions on > the external test set with the true values. However, the SVR predictions on the > training set have positive correlations with the experimental values and hence > I can't think of a way to know in advance if the trained SVR will produce > anti-correlated predictions in order to change their sign and avoid the > disaster. Here is an example of what I mean: > Training set predictions: R=0.452422, tau=0.333333 > External test set predictions: R=-0.537420, tau-0.300000 > Obviously, in a real case scenario where I wouldn't have the external test set > I would have used the worst observation instead of the best ones. Has anybody > any idea about how I could prevent this? > thanks in advance > Thomas -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From tevang3 at gmail.com Tue Sep 26 12:48:56 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 26 Sep 2017 18:48:56 +0200 Subject: [scikit-learn] anti-correlated predictions by SVR In-Reply-To: <20170926162137.GC2886336@phare.normalesup.org> References: <20170926162137.GC2886336@phare.normalesup.org> Message-ID: I have very small training sets (10-50 observations). Currently, I am working with 16 observations for training and 25 for validation (external test set). And I am doing Regression, not Classification (hence the SVR instead of SVC). On 26 September 2017 at 18:21, Gael Varoquaux wrote: > Hypothesis: you have a very small dataset and when you leave out data, > you create a distribution shift between the train and the test. A > simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out > cross-validation will create a training set of 10 samples of one class, 9 > samples of the other, and the test set is composed of the class that is > minority on the train set. > > G > > On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote: > > Greetings, > > > I don't know if anyone encountered this before, but sometimes I get > > anti-correlated predictions by the SVR I that am training. Namely, the > > Pearson's R and Kendall's tau are negative when I compare the > predictions on > > the external test set with the true values. However, the SVR predictions > on the > > training set have positive correlations with the experimental values and > hence > > I can't think of a way to know in advance if the trained SVR will produce > > anti-correlated predictions in order to change their sign and avoid the > > disaster. Here is an example of what I mean: > > > Training set predictions: R=0.452422, tau=0.333333 > > External test set predictions: R=-0.537420, tau-0.300000 > > > Obviously, in a real case scenario where I wouldn't have the external > test set > > I would have used the worst observation instead of the best ones. Has > anybody > > any idea about how I could prevent this? > > > thanks in advance > > Thomas > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Sep 26 12:56:12 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 26 Sep 2017 18:56:12 +0200 Subject: [scikit-learn] anti-correlated predictions by SVR In-Reply-To: References: <20170926162137.GC2886336@phare.normalesup.org> Message-ID: <9a6505b4-2780-4c4c-8208-a81cee6c1cb7@normalesup.org> I took my example in classification for didactic purposes. My hypothesis still holds that the splitting of the data creates anti correlations between train and test (a depletion effect). Basically , you shouldn't work with datasets that small. Ga?l ?Sent from my phone, please excuse typos and briefness? On Sep 26, 2017, 18:51, at 18:51, Thomas Evangelidis wrote: >I have very small training sets (10-50 observations). Currently, I am >working with 16 observations for training and 25 for validation >(external >test set). And I am doing Regression, not Classification (hence the SVR >instead of SVC). > > >On 26 September 2017 at 18:21, Gael Varoquaux >> wrote: > >> Hypothesis: you have a very small dataset and when you leave out >data, >> you create a distribution shift between the train and the test. A >> simplified example: 20 samples, 10 class a, 10 class b. A >leave-one-out >> cross-validation will create a training set of 10 samples of one >class, 9 >> samples of the other, and the test set is composed of the class that >is >> minority on the train set. >> >> G >> >> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote: >> > Greetings, >> >> > I don't know if anyone encountered this before, but sometimes I get >> > anti-correlated predictions by the SVR I that am training. Namely, >the >> > Pearson's R and Kendall's tau are negative when I compare the >> predictions on >> > the external test set with the true values. However, the SVR >predictions >> on the >> > training set have positive correlations with the experimental >values and >> hence >> > I can't think of a way to know in advance if the trained SVR will >produce >> > anti-correlated predictions in order to change their sign and avoid >the >> > disaster. Here is an example of what I mean: >> >> > Training set predictions: R=0.452422, tau=0.333333 >> > External test set predictions: R=-0.537420, tau-0.300000 >> >> > Obviously, in a real case scenario where I wouldn't have the >external >> test set >> > I would have used the worst observation instead of the best ones. >Has >> anybody >> > any idea about how I could prevent this? >> >> > thanks in advance >> > Thomas >> -- >> Gael Varoquaux >> Researcher, INRIA Parietal >> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> Phone: ++ 33-1-69-08-79-68 >> http://gael-varoquaux.info >http://twitter.com/GaelVaroquaux >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > >-- > >====================================================================== > >Dr Thomas Evangelidis > >Post-doctoral Researcher >CEITEC - Central European Institute of Technology >Masaryk University >Kamenice 5/A35/2S049, >62500 Brno, Czech Republic > >email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > >website: https://sites.google.com/site/thomasevangelidishomepage/ > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Sep 26 12:58:00 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 26 Sep 2017 12:58:00 -0400 Subject: [scikit-learn] anti-correlated predictions by SVR In-Reply-To: References: <20170926162137.GC2886336@phare.normalesup.org> Message-ID: I'd agree with Gael that a potential explanation could be the distribution shift upon splitting (usually the smaller the dataset, the more this is of an issue). As potential solutions/workarounds, you could try a) stratified sampling for regression, if you'd like to stick with the 2-way holdout method b) use leave-one-out cross validation for evaluation (your model will likely benefit from the additional training samples) c) use leave-one-out boostrap (at each round, draw a bootstrap sample from the dataset for training, then use the points not in the training dataset for testing) Best, Sebastian > On Sep 26, 2017, at 12:48 PM, Thomas Evangelidis wrote: > > I have very small training sets (10-50 observations). Currently, I am working with 16 observations for training and 25 for validation (external test set). And I am doing Regression, not Classification (hence the SVR instead of SVC). > > > On 26 September 2017 at 18:21, Gael Varoquaux wrote: > Hypothesis: you have a very small dataset and when you leave out data, > you create a distribution shift between the train and the test. A > simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out > cross-validation will create a training set of 10 samples of one class, 9 > samples of the other, and the test set is composed of the class that is > minority on the train set. > > G > > On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote: > > Greetings, > > > I don't know if anyone encountered this before, but sometimes I get > > anti-correlated predictions by the SVR I that am training. Namely, the > > Pearson's R and Kendall's tau are negative when I compare the predictions on > > the external test set with the true values. However, the SVR predictions on the > > training set have positive correlations with the experimental values and hence > > I can't think of a way to know in advance if the trained SVR will produce > > anti-correlated predictions in order to change their sign and avoid the > > disaster. Here is an example of what I mean: > > > Training set predictions: R=0.452422, tau=0.333333 > > External test set predictions: R=-0.537420, tau-0.300000 > > > Obviously, in a real case scenario where I wouldn't have the external test set > > I would have used the worst observation instead of the best ones. Has anybody > > any idea about how I could prevent this? > > > thanks in advance > > Thomas > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From apurva3000 at gmail.com Wed Sep 27 07:53:08 2017 From: apurva3000 at gmail.com (Apurva Nandan) Date: Wed, 27 Sep 2017 14:53:08 +0300 Subject: [scikit-learn] TF-IDF Message-ID: Hello, Could anybody tell me the difference between using augmented frequency (which is used for weighting term frequencies to eliminate the bias towards larger documents) and cosine normalization (l2 norm which scikit-learn uses for TfidfTransformer). Augmented frequency is given by the following equation. It tries to divide the natural term frequency by the maximum frequency of any term in the document. [image: Inline image 1] Do they both do the same thing when it comes to eliminating bias towards larger documents? I suppose scikit-learn uses the natural term freq, and using cosine normalization is enabled with using norm=l2 Any help would be appreciated! - Apurva -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 3602 bytes Desc: not available URL: