From niourf at gmail.com Sat Jun 1 10:00:23 2019 From: niourf at gmail.com (Nicolas Hug) Date: Sat, 1 Jun 2019 10:00:23 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: Message-ID: Splitting the data into train and test data is needed with any machine learning model (not just linear regression with or without least squares). The idea is that you want to evaluate the performance of your model (prediction + scoring) on a portion of the data that you did not use for training. You'll find more details in the user guide https://scikit-learn.org/stable/modules/cross_validation.html Nicolas On 5/31/19 8:54 PM, C W wrote: > Hello everyone, > > I'm new to scikit learn. I see that many tutorial in scikit-learn > follows the work-flow along the lines of > 1) tranform the data > 2) split the data: train, test > 3) instantiate the sklearn object and fit > 4) predict and tune parameter > > But, linear regression is done in least squares, so I don't think > train test split is necessary. So, I guess I can just use the entire > dataset? > > Thanks in advance! > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Sat Jun 1 22:42:14 2019 From: tmrsg11 at gmail.com (C W) Date: Sat, 1 Jun 2019 22:42:14 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: Message-ID: Hi Nicholas, I don't get it. The coefficients are estimated through OLS. Essentially, you are just calculating a matrix pseudo inverse, where beta = (X^T * X)^(-1) * X^T * y Splitting the data does not improve the model, It only works in something like LASSO, where you have a tuning parameter. Holding out some data will make the regression estimates worse off. Hope to hear from you, thanks! On Sat, Jun 1, 2019 at 10:04 AM Nicolas Hug wrote: > Splitting the data into train and test data is needed with any machine > learning model (not just linear regression with or without least squares). > > The idea is that you want to evaluate the performance of your model > (prediction + scoring) on a portion of the data that you did not use for > training. > > You'll find more details in the user guide > https://scikit-learn.org/stable/modules/cross_validation.html > > Nicolas > > > On 5/31/19 8:54 PM, C W wrote: > > Hello everyone, > > I'm new to scikit learn. I see that many tutorial in scikit-learn follows > the work-flow along the lines of > 1) tranform the data > 2) split the data: train, test > 3) instantiate the sklearn object and fit > 4) predict and tune parameter > > But, linear regression is done in least squares, so I don't think train > test split is necessary. So, I guess I can just use the entire dataset? > > Thanks in advance! > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jun 2 01:11:02 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 2 Jun 2019 15:11:02 +1000 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: Message-ID: You're right that you don't need to use CV for hyperparameter estimation in linear regression, but you may want it for model evaluation. As far as I understand: Holding out a test set is recommended if you aren't entirely sure that the assumptions of the model are held (gaussian error on a linear fit; independent and identically distributed samples). The model evaluation approach in predictive ML, using held-out data, relies only on the weaker assumption that the metric you have chosen, when applied to the test set you have held out, forms a reasonable measure of generalised / real-world performance. (Of course this too is often not held in practice, but it is the primary assumption, in my opinion, that ML practitioners need to be careful of.) On Sun, 2 Jun 2019 at 12:43, C W wrote: > Hi Nicholas, > > I don't get it. > > The coefficients are estimated through OLS. Essentially, you are just > calculating a matrix pseudo inverse, where > beta = (X^T * X)^(-1) * X^T * y > > Splitting the data does not improve the model, It only works in something > like LASSO, where you have a tuning parameter. > > Holding out some data will make the regression estimates worse off. > > Hope to hear from you, thanks! > > > > On Sat, Jun 1, 2019 at 10:04 AM Nicolas Hug wrote: > >> Splitting the data into train and test data is needed with any machine >> learning model (not just linear regression with or without least squares). >> >> The idea is that you want to evaluate the performance of your model >> (prediction + scoring) on a portion of the data that you did not use for >> training. >> >> You'll find more details in the user guide >> https://scikit-learn.org/stable/modules/cross_validation.html >> >> Nicolas >> >> >> On 5/31/19 8:54 PM, C W wrote: >> >> Hello everyone, >> >> I'm new to scikit learn. I see that many tutorial in scikit-learn follows >> the work-flow along the lines of >> 1) tranform the data >> 2) split the data: train, test >> 3) instantiate the sklearn object and fit >> 4) predict and tune parameter >> >> But, linear regression is done in least squares, so I don't think train >> test split is necessary. So, I guess I can just use the entire dataset? >> >> Thanks in advance! >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Jun 3 00:19:30 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 3 Jun 2019 13:19:30 +0900 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: Message-ID: > > As far as I understand: Holding out a test set is recommended if you > aren't entirely sure that the assumptions of the model are held (gaussian > error on a linear fit; independent and identically distributed samples). > The model evaluation approach in predictive ML, using held-out data, relies > only on the weaker assumption that the metric you have chosen, when applied > to the test set you have held out, forms a reasonable measure of > generalised / real-world performance. (Of course this too is often not held > in practice, but it is the primary assumption, in my opinion, that ML > practitioners need to be careful of.) > Dear CW, As Joel as said, holding out a test set will help you evaluate the validity of model assumptions, and his last point (reasonable measure of generalised performance) is absolutely essential for understanding the capabilities and limitations of ML. To add to your checklist of interpreting ML papers properly, be cautious when interpreting reports of high performance when using 5/10-fold or Leave-One-Out cross-validation on large datasets, where "large" depends on the nature of the problem setting. Results are also highly dependent on the distributions of the underlying independent variables (e.g., 60000 datapoints all with near-identical distributions may yield phenomenal performance in cross validation and be almost non-predictive in truly unknown/prospective situations). Even at 500 datapoints, if independent variable distributions look similar (with similar endpoints), then when each model is trained on 80% of that data, the remaining 20% will certainly be predictable, and repeating that five times will yield statistics that seem impressive. So, again, while problem context completely dictates ML experiment design, metric selection, and interpretation of outcome, my personal rule of thumb is to do no-more than 2-fold cross-validation (50% train, 50% predict) when having 100+ datapoints. Even more extreme, using try 33% for training and 66% for validation (or even 20/80). If your model still reports good statistics, then you can believe that the patterns in the training data extrapolate well to the ones in the external validation data. Hope this helps, J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Mon Jun 3 08:20:28 2019 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Mon, 3 Jun 2019 08:20:28 -0400 Subject: [scikit-learn] Only a few days left to submit! -- 2019 John Hunter Excellence in Plotting Contest Message-ID: Hi everybody, There are only a few days left to submit to the 2019 John Hunter Excellence in Plotting Contest! If you're interested in participating, note that you have until June 8th to prepare your submission. In memory of John Hunter, we are pleased to be reviving the SciPy John Hunter Excellence in Plotting Competition for 2019. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at the conference. John Hunter?s family and NumFocus are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June, 8th to the form at https://goo.gl/forms/cFTB3FUBrMPfQ7Vz1 - Winners will be announced at Scipy 2019 in Austin, TX. - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. This may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). SciPy John Hunter Excellence in Plotting Competition Co-Chairs Hannah Aizenman Thomas Caswell Madicken Munk Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 3 11:41:17 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 3 Jun 2019 11:41:17 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: Message-ID: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> This classical paper on statistical practices (Breiman's "two cultures") might be helpful to understand the different viewpoints: https://projecteuclid.org/euclid.ss/1009213726 On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote: > > As far as I understand: Holding out a test set is recommended if > you aren't entirely sure that the assumptions of the model are > held (gaussian error on a linear fit; independent and identically > distributed samples). The model evaluation approach in predictive > ML, using held-out data, relies only on the weaker assumption that > the metric you have chosen, when applied to the test set you have > held out, forms a reasonable measure of generalised / real-world > performance. (Of course this too is often not held in practice, > but it is the primary assumption, in my opinion,?that ML > practitioners need to be careful of.) > > > Dear CW, > As Joel as said, holding out a test set will help you evaluate the > validity of model assumptions, and his last point (reasonable measure > of generalised performance) is absolutely essential for understanding > the capabilities and limitations of ML. > > To add to your checklist of interpreting ML papers properly, be > cautious when interpreting reports of high performance when using > 5/10-fold or Leave-One-Out cross-validation on large datasets, where > "large" depends on the nature of the problem setting. > Results are also highly dependent on the distributions of the > underlying independent variables (e.g., 60000 datapoints all with > near-identical distributions may yield phenomenal performance in cross > validation and be almost non-predictive in truly unknown/prospective > situations). > Even at 500 datapoints, if independent variable distributions look > similar (with similar endpoints), then when each model is trained on > 80% of that data, the remaining 20% will certainly be predictable, and > repeating that five times will yield statistics that seem impressive. > > So, again, while problem context completely dictates ML experiment > design, metric selection, and interpretation of outcome, my personal > rule of thumb is to do no-more than 2-fold cross-validation (50% > train, 50% predict) when having 100+ datapoints. > Even more extreme, using try 33% for training and 66% for validation > (or even 20/80). > If your model still reports good statistics, then you can believe that > the patterns in the training data extrapolate well to the ones in the > external validation data. > > Hope this helps, > J.B. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Tue Jun 4 20:44:38 2019 From: tmrsg11 at gmail.com (C W) Date: Tue, 4 Jun 2019 20:44:38 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: Thank you all for the replies. I agree that prediction accuracy is great for evaluating black-box ML models. Especially advanced models like neural networks, or not-so-black models like LASSO, because they are NP-hard to solve. Linear regression is not a black-box. I view prediction accuracy as an overkill on interpretable models. Especially when you can use R-squared, coefficient significance, etc. Prediction accuracy also does not tell you which feature is important. What do you guys think? Thank you! . On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller wrote: > This classical paper on statistical practices (Breiman's "two cultures") > might be helpful to understand the different viewpoints: > > https://projecteuclid.org/euclid.ss/1009213726 > > > On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote: > > As far as I understand: Holding out a test set is recommended if you >> aren't entirely sure that the assumptions of the model are held (gaussian >> error on a linear fit; independent and identically distributed samples). >> The model evaluation approach in predictive ML, using held-out data, relies >> only on the weaker assumption that the metric you have chosen, when applied >> to the test set you have held out, forms a reasonable measure of >> generalised / real-world performance. (Of course this too is often not held >> in practice, but it is the primary assumption, in my opinion, that ML >> practitioners need to be careful of.) >> > > Dear CW, > As Joel as said, holding out a test set will help you evaluate the > validity of model assumptions, and his last point (reasonable measure of > generalised performance) is absolutely essential for understanding the > capabilities and limitations of ML. > > To add to your checklist of interpreting ML papers properly, be cautious > when interpreting reports of high performance when using 5/10-fold or > Leave-One-Out cross-validation on large datasets, where "large" depends on > the nature of the problem setting. > Results are also highly dependent on the distributions of the underlying > independent variables (e.g., 60000 datapoints all with near-identical > distributions may yield phenomenal performance in cross validation and be > almost non-predictive in truly unknown/prospective situations). > Even at 500 datapoints, if independent variable distributions look similar > (with similar endpoints), then when each model is trained on 80% of that > data, the remaining 20% will certainly be predictable, and repeating that > five times will yield statistics that seem impressive. > > So, again, while problem context completely dictates ML experiment design, > metric selection, and interpretation of outcome, my personal rule of thumb > is to do no-more than 2-fold cross-validation (50% train, 50% predict) when > having 100+ datapoints. > Even more extreme, using try 33% for training and 66% for validation (or > even 20/80). > If your model still reports good statistics, then you can believe that the > patterns in the training data extrapolate well to the ones in the external > validation data. > > Hope this helps, > J.B. > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Jun 4 21:43:09 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Wed, 5 Jun 2019 10:43:09 +0900 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: Dear CW, > Linear regression is not a black-box. I view prediction accuracy as an > overkill on interpretable models. Especially when you can use R-squared, > coefficient significance, etc. > Following on my previous note about being cautious with cross-validated evaluation for classification, the same applies for regression. About 20 years ago, chemoinformatics researchers pointed out the caution needed with using CV-based R^2 (q^2) as a measure of performance. "Beware of q2!" Golbraikh and Tropsha, J Mol Graph Modeling (2002) 20:269 https://www.sciencedirect.com/science/article/pii/S1093326301001231 In this article, they propose to measure correlation by using both known-VS-predicted _and_ predicted-VS-known calculations of the correlation coefficient, and importantly, that the regression line to fit in both cases goes through the origin. The resulting coefficients are checked as a pair, and the authors argue that only if they are both high can one say that the model is fitting the data well. Contrast this to Pearson Product Moment Correlation (R), where the fit of the line has no requirement to go through the origin of the fit. I found the paper above to be helpful in filtering for more robust regression models, and have implemented my own version of their method, which I use as my first evaluation metric when performing regression modelling. Hope this provides you some thought. Prediction accuracy also does not tell you which feature is important. > The contributions of the scikit-learn community have yielded a great set of tools for performing feature weighting separate from model performance evaluation. All you need to do is read the documentation and try out some of the examples, and you should be ready to adapt to your situation. J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Wed Jun 5 02:43:28 2019 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Wed, 5 Jun 2019 07:43:28 +0100 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: Hi CW, It's not about the concept of the black box, none of the algorithms in sklearn are a blackbox. The question is about model validity. Is linear regression a valid representation of your data? That's what the train/test answers. You may think so, but only this process will answer it properly. Matthieu Le mer. 5 juin 2019 ? 01:46, C W a ?crit : > Thank you all for the replies. > > I agree that prediction accuracy is great for evaluating black-box ML > models. Especially advanced models like neural networks, or not-so-black > models like LASSO, because they are NP-hard to solve. > > Linear regression is not a black-box. I view prediction accuracy as an > overkill on interpretable models. Especially when you can use R-squared, > coefficient significance, etc. > > Prediction accuracy also does not tell you which feature is important. > > What do you guys think? Thank you! > > . > > On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller wrote: > >> This classical paper on statistical practices (Breiman's "two cultures") >> might be helpful to understand the different viewpoints: >> >> https://projecteuclid.org/euclid.ss/1009213726 >> >> >> On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote: >> >> As far as I understand: Holding out a test set is recommended if you >>> aren't entirely sure that the assumptions of the model are held (gaussian >>> error on a linear fit; independent and identically distributed samples). >>> The model evaluation approach in predictive ML, using held-out data, relies >>> only on the weaker assumption that the metric you have chosen, when applied >>> to the test set you have held out, forms a reasonable measure of >>> generalised / real-world performance. (Of course this too is often not held >>> in practice, but it is the primary assumption, in my opinion, that ML >>> practitioners need to be careful of.) >>> >> >> Dear CW, >> As Joel as said, holding out a test set will help you evaluate the >> validity of model assumptions, and his last point (reasonable measure of >> generalised performance) is absolutely essential for understanding the >> capabilities and limitations of ML. >> >> To add to your checklist of interpreting ML papers properly, be cautious >> when interpreting reports of high performance when using 5/10-fold or >> Leave-One-Out cross-validation on large datasets, where "large" depends on >> the nature of the problem setting. >> Results are also highly dependent on the distributions of the underlying >> independent variables (e.g., 60000 datapoints all with near-identical >> distributions may yield phenomenal performance in cross validation and be >> almost non-predictive in truly unknown/prospective situations). >> Even at 500 datapoints, if independent variable distributions look >> similar (with similar endpoints), then when each model is trained on 80% of >> that data, the remaining 20% will certainly be predictable, and repeating >> that five times will yield statistics that seem impressive. >> >> So, again, while problem context completely dictates ML experiment >> design, metric selection, and interpretation of outcome, my personal rule >> of thumb is to do no-more than 2-fold cross-validation (50% train, 50% >> predict) when having 100+ datapoints. >> Even more extreme, using try 33% for training and 66% for validation (or >> even 20/80). >> If your model still reports good statistics, then you can believe that >> the patterns in the training data extrapolate well to the ones in the >> external validation data. >> >> Hope this helps, >> J.B. >> >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Quantitative researcher, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Wed Jun 5 03:17:16 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Wed, 5 Jun 2019 16:17:16 +0900 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: 2019?6?5?(?) 10:43 Brown J.B. : > Contrast this to Pearson Product Moment Correlation (R), where the fit of > the line has no requirement to go through the origin of the fit. > Not sure what I was thinking when I wrote that. Pardon the mistake; I'm fully aware that Pearson R is merely a coefficient merely indicating direction of trend. -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Wed Jun 5 05:45:17 2019 From: matthew.brett at gmail.com (Matthew Brett) Date: Wed, 5 Jun 2019 10:45:17 +0100 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: On Wed, Jun 5, 2019 at 8:18 AM Brown J.B. via scikit-learn wrote: > > 2019?6?5?(?) 10:43 Brown J.B. : >> >> Contrast this to Pearson Product Moment Correlation (R), where the fit of the line has no requirement to go through the origin of the fit. > > > Not sure what I was thinking when I wrote that. > Pardon the mistake; I'm fully aware that Pearson R is merely a coefficient merely indicating direction of trend. Ah - now I'm more confused. r is surely a coefficient, but I personally find it most useful to think of r as the least-squares regression slope once the x and y values have been transformed to standard scores. For that case, the least-squares intercept must be 0. Cheers, Matthew From pahome.chen at mirlab.org Wed Jun 5 06:56:35 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 5 Jun 2019 18:56:35 +0800 Subject: [scikit-learn] Any way to tune threshold of Birch rather than GridSearchCV? Message-ID: I use Birch to cluster my data and my data is kind of time-series data. I don't know the actually cluster numbers and need to read large data(online learning), so I choose Birch rather than MiniKmeans. When I read it, I found the critical parameters might be branching_factor and threshold, and threshold will affect my cluster numbers obviously! Any way to estimate the suitable threshold of Birch? Any paper suggestion is ok. thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jun 5 09:09:08 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 5 Jun 2019 09:09:08 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: On 6/4/19 8:44 PM, C W wrote: > Thank you all for the replies. > > I agree that prediction accuracy is great for evaluating black-box ML > models. Especially advanced models like neural networks, or > not-so-black models like LASSO, because they are NP-hard to solve. > > Linear regression is not a black-box. I view prediction accuracy as an > overkill on interpretable models. Especially when you can use > R-squared, coefficient significance, etc. > > Prediction accuracy also does not tell you which feature is important. > > What do you guys think? Thank you! > Did you read the paper that I sent? ;) From pahome.chen at mirlab.org Thu Jun 6 03:05:28 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 6 Jun 2019 15:05:28 +0800 Subject: [scikit-learn] fit before partial_fit ? Message-ID: I tried MiniBatchKMeans with two order: fit -> partial_fit partial_fit -> partial_fit The clustering results are different what's their difference? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmetcik at fhi-berlin.mpg.de Thu Jun 6 08:56:59 2019 From: ahmetcik at fhi-berlin.mpg.de (ahmetcik) Date: Thu, 06 Jun 2019 14:56:59 +0200 Subject: [scikit-learn] Normalization in ridge regression when there is no intercept Message-ID: <1704d7dd5f25fd34fa23931406b7b846@fhi-berlin.mpg.de> Hello everyone, I have just recognized that when using ridge regression without an intercept no normalization is performed even if the argument "normalize" is set to True. Though it is, of course, no problem to manually normalize the input matrix X I have become curious if there was a special reason to not normalize the data, e.g. the columns of X scaled (but not centered to have mean zero) to have unit norm such that their lengths do not affect the outcome. Thanks in advance! Emre From vaggi.federico at gmail.com Thu Jun 6 13:06:39 2019 From: vaggi.federico at gmail.com (federico vaggi) Date: Thu, 6 Jun 2019 10:06:39 -0700 Subject: [scikit-learn] fit before partial_fit ? In-Reply-To: References: Message-ID: k-means isn't a convex problem, unless you freeze the initialization, you are going to get very different solutions (depending on the dataset) with different initializations. On Thu, Jun 6, 2019 at 12:05 AM lampahome wrote: > I tried MiniBatchKMeans with two order: > fit -> partial_fit > partial_fit -> partial_fit > > The clustering results are different > > what's their difference? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Fri Jun 7 04:13:46 2019 From: rth.yurchak at pm.me (Roman Yurchak) Date: Fri, 07 Jun 2019 08:13:46 +0000 Subject: [scikit-learn] Normalization in ridge regression when there is no intercept In-Reply-To: <1704d7dd5f25fd34fa23931406b7b846@fhi-berlin.mpg.de> References: <1704d7dd5f25fd34fa23931406b7b846@fhi-berlin.mpg.de> Message-ID: On 06/06/2019 14:56, ahmetcik wrote: > I have just recognized that when using ridge regression without an > intercept no normalization is performed even if the argument "normalize" > is set to True. It's a known longstanding issue https://github.com/scikit-learn/scikit-learn/issues/3020 It would be indeed good to find a solution. -- Roman From adrin.jalali at gmail.com Fri Jun 7 10:50:59 2019 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 7 Jun 2019 18:50:59 +0400 Subject: [scikit-learn] Google code reviews In-Reply-To: References: Message-ID: Would we need to nominate PRs for them to review, or would they find them on their own? Either case, could use a hand and extra eyes, why not On Sat., May 25, 2019, 16:10 Joel Nothman, wrote: > For some of the larger PRs, this might be helpful. Not going to help where > the intricacies of Scikit-learn API come in play. > > On Sat, 25 May 2019 at 04:17, Andreas Mueller wrote: > >> Hi All. >> What do you think of https://www.pullrequest.com/googleserve/? >> It's sponsored code reviews. Could be interesting, right? >> >> Best, >> Andy >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Jun 7 11:21:10 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 7 Jun 2019 11:21:10 -0400 Subject: [scikit-learn] Google code reviews In-Reply-To: References: Message-ID: <284ec1fe-2aec-e6b6-3736-49934f06a126@gmail.com> I think they might actually review the existing code base? But I'm not entirely sure. We can also nominate PRs, I think. On 6/7/19 10:50 AM, Adrin wrote: > Would we need to nominate PRs for them to review, or would they find > them on their own? Either case, could use a hand and extra eyes, why not > > On Sat., May 25, 2019, 16:10 Joel Nothman, > wrote: > > For some of the larger PRs, this might be helpful. Not going to > help where the intricacies of Scikit-learn API come in play. > > On Sat, 25 May 2019 at 04:17, Andreas Mueller > wrote: > > Hi All. > What do you think of https://www.pullrequest.com/googleserve/? > It's sponsored code reviews. Could be interesting, right? > > Best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericjvandervelden at gmail.com Sat Jun 8 05:34:25 2019 From: ericjvandervelden at gmail.com (Eric J. Van der Velden) Date: Sat, 8 Jun 2019 11:34:25 +0200 Subject: [scikit-learn] LogisticRegression Message-ID: Hello, I am learning sklearn from my book of Geron. On page 137 he learns the model of petal widths. When I implements logistic regression myself as I learned from my Coursera course or from my book of Bishop I find that the following parameters are found where the cost function is minimal: In [6219]: w Out[6219]: array([[-21.12563996], [ 12.94750716]]) I used Gradient Descent and Newton-Raphson, both give the same answer. My question is: how can I see after fit() which parameters LogisticRegression() has found? One other question also: when I read the documentation page, https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, I see a different cost function as I read in the books. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericjvandervelden at gmail.com Sat Jun 8 13:56:39 2019 From: ericjvandervelden at gmail.com (Eric J. Van der Velden) Date: Sat, 8 Jun 2019 19:56:39 +0200 Subject: [scikit-learn] LogisticRegression In-Reply-To: References: Message-ID: Here I have added what I had programmed. With sklearn's LogisticRegression(), how can I see the parameters it has found after .fit() where the cost is minimal? I use the book of Geron about scikit-learn and tensorflow and on page 137 he trains the model of petal widths. I did the following: iris=datasets.load_iris() a1=iris['data'][:,3:] y=(iris['target']==2).astype(int) log_reg=LogisticRegression() log_reg.fit(a1,y) log_reg.coef_ array([[2.61727777]]) log_reg.intercept_ array([-4.2209364]) I did the logistic regression myself with Gradient Descent or Newton-Raphson as I learned from my Coursera course and respectively from my book of Bishop. I used the Gradient Descent method like so: from sklearn import datasets iris=datasets.load_iris() a1=iris['data'][:,3:] A1=np.c_[np.ones((150,1)),a1] y=(iris['target']==2).astype(int).reshape(-1,1) lmda=1 from scipy.special import expit def logreg_gd(w): z2=A1.dot(w) a2=expit(z2) delta2=a2-y w=w-(lmda/len(a1))*A1.T.dot(delta2) return w w=np.array([[0],[0]]) for i in range(0,100000): w=logreg_gd(w) In [6219]: w Out[6219]: array([[-21.12563996], [ 12.94750716]]) I used Newton-Raphson like so, see Bishop page 207, from sklearn import datasets iris=datasets.load_iris() a1=iris['data'][:,3:] A1=np.c_[np.ones(len(a1)),a1] y=(iris['target']==2).astype(int).reshape(-1,1) def logreg_nr(w): z1=A1.dot(w) y=expit(z1) R=np.diag((y*(1-y))[:,0]) H=A1.T.dot(R).dot(A1) tmp=A1.dot(w)-np.linalg.inv(R).dot(y-t) v=np.linalg.inv(H).dot(A1.T).dot(R).dot(tmp) return v w=np.array([[0],[0]]) for i in range(0,10): w=logreg_nr(w) In [5149]: w Out[5149]: array([[-21.12563996], [ 12.94750716]]) Notice how much faster Newton-Raphson goes than Gradient Descent. But they give the same result. How can I see which parameters LogisticRegression() found? And should I give LogisticRegression other parameters? On Sat, Jun 8, 2019 at 11:34 AM Eric J. Van der Velden < ericjvandervelden at gmail.com> wrote: > Hello, > > I am learning sklearn from my book of Geron. On page 137 he learns the > model of petal widths. > > When I implements logistic regression myself as I learned from my Coursera > course or from my book of Bishop I find that the following parameters are > found where the cost function is minimal: > > In [6219]: w > Out[6219]: > array([[-21.12563996], > [ 12.94750716]]) > > I used Gradient Descent and Newton-Raphson, both give the same answer. > > My question is: how can I see after fit() which parameters > LogisticRegression() has found? > > One other question also: when I read the documentation page, > https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, > I see a different cost function as I read in the books. > > Thanks. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Sun Jun 9 21:18:28 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 10 Jun 2019 09:18:28 +0800 Subject: [scikit-learn] Tune parameters when I need to load data segment by segment? Message-ID: As title I have one huge data to load, so I need to train it incrementally. So I load data segment by segment and train segment by segment like: MiniBatchKMeans. In that condition, how to tune parameters? tune the first part of data or every part of data? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Sun Jun 9 22:10:53 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 10 Jun 2019 10:10:53 +0800 Subject: [scikit-learn] fit before partial_fit ? In-Reply-To: References: Message-ID: federico vaggi ? 2019?6?7? ?? ??1:08??? > k-means isn't a convex problem, unless you freeze the initialization, you > are going to get very different solutions (depending on the dataset) with > different initializations. > > Nope, I specify the random_state=0. u can try it. >>> x = np.array([[1,2],[2,3]]) >>> y = np.array([[3,4],[4,5],[5,6]]) >>> z = np.append(x,y, axis=0) >>> from sklearn.cluster import MiniBatchKMeans as MBK >>> m = MBK(random_state=0, n_clusters=2) >>> m.fit(x) ; m.labels_ array([1,0], dtype=int32) <-- (1-a) >>> m.partial_fit(y) ; m.labels_ array([0,0,0], dtype=int32) <-- (1-b) >>> m = MBK(random_state=0, n_clusters=2) >>> m.partial_fit(x) ; m.labels_ array([0,1], dtype=int32) <-- (2-a) >>> m.partial_fit(y) ; m.labels_ array([1,1,1], dtype=int32) <-- (2-b) 1-a,1-b and 2-a, 2-b are all different, especially the members of each cluster. I'm just confused about what usage of partial_fit and fit is the suitable(reasonable?) way to cluster incrementally? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.braune79 at gmail.com Mon Jun 10 00:25:24 2019 From: christian.braune79 at gmail.com (Christian Braune) Date: Mon, 10 Jun 2019 06:25:24 +0200 Subject: [scikit-learn] fit before partial_fit ? In-Reply-To: References: Message-ID: The clusters produces by your examples are actually the same (despite the different labels). I'd guess that "fit" and "partial_fit" draw a different amount of random_numbers before actually assigning a label to the first (randomly drawn) sample from "x" (in your code). This is why the labeling is permutated. Best regards Christian Am Mo., 10. Juni 2019 um 04:12 Uhr schrieb lampahome : > > > federico vaggi ? 2019?6?7? ?? ??1:08??? > >> k-means isn't a convex problem, unless you freeze the initialization, you >> are going to get very different solutions (depending on the dataset) with >> different initializations. >> >> > Nope, I specify the random_state=0. u can try it. > > >>> x = np.array([[1,2],[2,3]]) > >>> y = np.array([[3,4],[4,5],[5,6]]) > >>> z = np.append(x,y, axis=0) > >>> from sklearn.cluster import MiniBatchKMeans as MBK > >>> m = MBK(random_state=0, n_clusters=2) > >>> m.fit(x) ; m.labels_ > array([1,0], dtype=int32) <-- (1-a) > >>> m.partial_fit(y) ; m.labels_ > array([0,0,0], dtype=int32) <-- (1-b) > > >>> m = MBK(random_state=0, n_clusters=2) > >>> m.partial_fit(x) ; m.labels_ > array([0,1], dtype=int32) <-- (2-a) > >>> m.partial_fit(y) ; m.labels_ > array([1,1,1], dtype=int32) <-- (2-b) > > 1-a,1-b and 2-a, 2-b are all different, especially the members of each > cluster. > I'm just confused about what usage of partial_fit and fit is the > suitable(reasonable?) way to cluster incrementally? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Mon Jun 10 03:16:17 2019 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Mon, 10 Jun 2019 09:16:17 +0200 Subject: [scikit-learn] Difference in normalization between Lasso and LogisticRegression + L1 In-Reply-To: References: Message-ID: see https://github.com/scikit-learn/scikit-learn/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aclosed+scale_C+ for historical perspective on this issue. Alex On Wed, May 29, 2019 at 11:32 PM Stuart Reynolds wrote: > > I looked into like a while ago. There were differences in which algorithms regularized the intercept, and which ones do not. (I believe liblinear does, lbgfs does not). > All of the algorithms disagreed with logistic regression in scipy. > > - Stuart > > On Wed, May 29, 2019 at 10:50 AM Andreas Mueller wrote: >> >> That is not very ideal indeed. >> I think we just went with what liblinear did, and when saga was introduced kept that behavior. >> It should probably be scaled as in Lasso, I would imagine? >> >> >> On 5/29/19 1:42 PM, Michael Eickenberg wrote: >> >> Hi Jesse, >> >> I think there was an effort to compare normalization methods on the data attachment term between Lasso and Ridge regression back in 2012/13, but this might have not been finished or extended to Logistic Regression. >> >> If it is not documented well, it could definitely benefit from a documentation update. >> >> As for changing it to a more consistent state, that would require adding a keyword argument pertaining to this functionality and, after discussion, possibly changing the default value after some deprecation cycles (though this seems like a dangerous one to change at all imho). >> >> Michael >> >> >> On Wed, May 29, 2019 at 10:38 AM Jesse Livezey wrote: >>> >>> Hi everyone, >>> >>> I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples >>> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html >>> >>> but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef. >>> >>> Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.) >>> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >>> >>> Jesse >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Mon Jun 10 06:58:01 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 10 Jun 2019 18:58:01 +0800 Subject: [scikit-learn] How to tune parameters when using partial_fit Message-ID: as title, I try to cluster a huge data, but I don't know how to tune parameters when clustering. If it's a small dataset, I can use gridsearchcv, but how to if it's a huge data? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 10 13:23:21 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 10 Jun 2019 13:23:21 -0400 Subject: [scikit-learn] How to tune parameters when using partial_fit In-Reply-To: References: Message-ID: <4b70d51d-29dd-fcc3-822c-6d6074344935@gmail.com> There's no built-in way to do that with scikit-learn right now, sorry. On 6/10/19 6:58 AM, lampahome wrote: > as title, > > I try to cluster a huge data, but I don't know how to tune parameters > when clustering. > > If it's a small dataset, I can use gridsearchcv, but how to if it's a > huge data? > > thx > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Tue Jun 11 04:07:54 2019 From: ahowe42 at gmail.com (Andrew Howe) Date: Tue, 11 Jun 2019 09:07:54 +0100 Subject: [scikit-learn] LogisticRegression In-Reply-To: References: Message-ID: The coef_ attribute of the LogisticRegression object stores the parameters. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Sat, Jun 8, 2019 at 6:58 PM Eric J. Van der Velden < ericjvandervelden at gmail.com> wrote: > Here I have added what I had programmed. > > With sklearn's LogisticRegression(), how can I see the parameters it has > found after .fit() where the cost is minimal? I use the book of Geron about > scikit-learn and tensorflow and on page 137 he trains the model of petal > widths. I did the following: > > iris=datasets.load_iris() > a1=iris['data'][:,3:] > y=(iris['target']==2).astype(int) > log_reg=LogisticRegression() > log_reg.fit(a1,y) > > log_reg.coef_ > array([[2.61727777]]) > log_reg.intercept_ > array([-4.2209364]) > > > I did the logistic regression myself with Gradient Descent or > Newton-Raphson as I learned from my Coursera course and respectively from > my book of Bishop. I used the Gradient Descent method like so: > > from sklearn import datasets > iris=datasets.load_iris() > a1=iris['data'][:,3:] > A1=np.c_[np.ones((150,1)),a1] > y=(iris['target']==2).astype(int).reshape(-1,1) > lmda=1 > > from scipy.special import expit > > def logreg_gd(w): > z2=A1.dot(w) > a2=expit(z2) > delta2=a2-y > w=w-(lmda/len(a1))*A1.T.dot(delta2) > return w > > w=np.array([[0],[0]]) > for i in range(0,100000): > w=logreg_gd(w) > > In [6219]: w > Out[6219]: > array([[-21.12563996], > [ 12.94750716]]) > > I used Newton-Raphson like so, see Bishop page 207, > > from sklearn import datasets > iris=datasets.load_iris() > a1=iris['data'][:,3:] > A1=np.c_[np.ones(len(a1)),a1] > y=(iris['target']==2).astype(int).reshape(-1,1) > > def logreg_nr(w): > z1=A1.dot(w) > y=expit(z1) > R=np.diag((y*(1-y))[:,0]) > H=A1.T.dot(R).dot(A1) > tmp=A1.dot(w)-np.linalg.inv(R).dot(y-t) > v=np.linalg.inv(H).dot(A1.T).dot(R).dot(tmp) > return v > > w=np.array([[0],[0]]) > for i in range(0,10): > w=logreg_nr(w) > > In [5149]: w > Out[5149]: > array([[-21.12563996], > [ 12.94750716]]) > > Notice how much faster Newton-Raphson goes than Gradient Descent. But they > give the same result. > > How can I see which parameters LogisticRegression() found? And should I > give LogisticRegression other parameters? > > On Sat, Jun 8, 2019 at 11:34 AM Eric J. Van der Velden < > ericjvandervelden at gmail.com> wrote: > >> Hello, >> >> I am learning sklearn from my book of Geron. On page 137 he learns the >> model of petal widths. >> >> When I implements logistic regression myself as I learned from my >> Coursera course or from my book of Bishop I find that the following >> parameters are found where the cost function is minimal: >> >> In [6219]: w >> Out[6219]: >> array([[-21.12563996], >> [ 12.94750716]]) >> >> I used Gradient Descent and Newton-Raphson, both give the same answer. >> >> My question is: how can I see after fit() which parameters >> LogisticRegression() has found? >> >> One other question also: when I read the documentation page, >> https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, >> I see a different cost function as I read in the books. >> >> Thanks. >> >> >> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Tue Jun 11 04:38:07 2019 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 11 Jun 2019 16:38:07 +0800 Subject: [scikit-learn] How to tune parameters when using partial_fit In-Reply-To: <4b70d51d-29dd-fcc3-822c-6d6074344935@gmail.com> References: <4b70d51d-29dd-fcc3-822c-6d6074344935@gmail.com> Message-ID: I know there's no built-in way to tune parameter batch by batch. I'm curious about is there any suitable/general way to tune parameters batch by batch? Because the distribution is not easy to know when the dataset is too large to load into memory. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Jun 11 05:34:07 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 11 Jun 2019 18:34:07 +0900 Subject: [scikit-learn] How to tune parameters when using partial_fit In-Reply-To: References: <4b70d51d-29dd-fcc3-822c-6d6074344935@gmail.com> Message-ID: > > I'm curious about is there any suitable/general way to tune parameters > batch by batch? > Because the distribution is not easy to know when the dataset is too large > to load into memory. > Repeated subsampling to estimate a distribution is one alternative. Not guaranteed to match the global distribution, but you should get a reasonable estimate with enough repetitions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericjvandervelden at gmail.com Tue Jun 11 11:47:09 2019 From: ericjvandervelden at gmail.com (Eric J. Van der Velden) Date: Tue, 11 Jun 2019 17:47:09 +0200 Subject: [scikit-learn] LogisticRegression In-Reply-To: References: Message-ID: Hi Nicolas, Andrew, Thanks! I found out that it is the regularization term. Sklearn always has that term. When I program logistic regression with that term too, with \lambda=1, I get exactly the same answer as sklearn, when I look at the parameters you gave me. Question is why sklearn always has that term in logistic regression. If you have enough data, do you need a regularization term? Op di 11 jun. 2019 10:08 schreef Andrew Howe : > The coef_ attribute of the LogisticRegression object stores the parameters. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > On Sat, Jun 8, 2019 at 6:58 PM Eric J. Van der Velden < > ericjvandervelden at gmail.com> wrote: > >> Here I have added what I had programmed. >> >> With sklearn's LogisticRegression(), how can I see the parameters it has >> found after .fit() where the cost is minimal? I use the book of Geron about >> scikit-learn and tensorflow and on page 137 he trains the model of petal >> widths. I did the following: >> >> iris=datasets.load_iris() >> a1=iris['data'][:,3:] >> y=(iris['target']==2).astype(int) >> log_reg=LogisticRegression() >> log_reg.fit(a1,y) >> >> log_reg.coef_ >> array([[2.61727777]]) >> log_reg.intercept_ >> array([-4.2209364]) >> >> >> I did the logistic regression myself with Gradient Descent or >> Newton-Raphson as I learned from my Coursera course and respectively from >> my book of Bishop. I used the Gradient Descent method like so: >> >> from sklearn import datasets >> iris=datasets.load_iris() >> a1=iris['data'][:,3:] >> A1=np.c_[np.ones((150,1)),a1] >> y=(iris['target']==2).astype(int).reshape(-1,1) >> lmda=1 >> >> from scipy.special import expit >> >> def logreg_gd(w): >> z2=A1.dot(w) >> a2=expit(z2) >> delta2=a2-y >> w=w-(lmda/len(a1))*A1.T.dot(delta2) >> return w >> >> w=np.array([[0],[0]]) >> for i in range(0,100000): >> w=logreg_gd(w) >> >> In [6219]: w >> Out[6219]: >> array([[-21.12563996], >> [ 12.94750716]]) >> >> I used Newton-Raphson like so, see Bishop page 207, >> >> from sklearn import datasets >> iris=datasets.load_iris() >> a1=iris['data'][:,3:] >> A1=np.c_[np.ones(len(a1)),a1] >> y=(iris['target']==2).astype(int).reshape(-1,1) >> >> def logreg_nr(w): >> z1=A1.dot(w) >> y=expit(z1) >> R=np.diag((y*(1-y))[:,0]) >> H=A1.T.dot(R).dot(A1) >> tmp=A1.dot(w)-np.linalg.inv(R).dot(y-t) >> v=np.linalg.inv(H).dot(A1.T).dot(R).dot(tmp) >> return v >> >> w=np.array([[0],[0]]) >> for i in range(0,10): >> w=logreg_nr(w) >> >> In [5149]: w >> Out[5149]: >> array([[-21.12563996], >> [ 12.94750716]]) >> >> Notice how much faster Newton-Raphson goes than Gradient Descent. But >> they give the same result. >> >> How can I see which parameters LogisticRegression() found? And should I >> give LogisticRegression other parameters? >> >> On Sat, Jun 8, 2019 at 11:34 AM Eric J. Van der Velden < >> ericjvandervelden at gmail.com> wrote: >> >>> Hello, >>> >>> I am learning sklearn from my book of Geron. On page 137 he learns the >>> model of petal widths. >>> >>> When I implements logistic regression myself as I learned from my >>> Coursera course or from my book of Bishop I find that the following >>> parameters are found where the cost function is minimal: >>> >>> In [6219]: w >>> Out[6219]: >>> array([[-21.12563996], >>> [ 12.94750716]]) >>> >>> I used Gradient Descent and Newton-Raphson, both give the same answer. >>> >>> My question is: how can I see after fit() which parameters >>> LogisticRegression() has found? >>> >>> One other question also: when I read the documentation page, >>> https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, >>> I see a different cost function as I read in the books. >>> >>> Thanks. >>> >>> >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jun 11 14:47:57 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 11 Jun 2019 14:47:57 -0400 Subject: [scikit-learn] LogisticRegression In-Reply-To: References: Message-ID: <295e5a02-def7-6d85-4a17-0874e191328e@gmail.com> On 6/11/19 11:47 AM, Eric J. Van der Velden wrote: > Hi Nicolas, Andrew, > > Thanks! > > I found out that it is the regularization term. Sklearn always has > that term. When I program logistic regression with that term too, with > \lambda=1, I get exactly the same answer as sklearn, when I look at > the parameters you gave me. > > Question is why sklearn always has that term in logistic regression. > If you have enough data, do you need a regularization term? It's equivalent to setting C to a high value. We now allow penalty='none' in logisticregression, see https://github.com/scikit-learn/scikit-learn/pull/12860 I opened an issue on improving the docs: https://github.com/scikit-learn/scikit-learn/issues/14070 feel free to make suggestions there. There's more discussion here as well: https://github.com/scikit-learn/scikit-learn/issues/6738 > > Op di 11 jun. 2019 10:08 schreef Andrew Howe >: > > The coef_ attribute of the LogisticRegression object stores the > parameters. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > On Sat, Jun 8, 2019 at 6:58 PM Eric J. Van der Velden > > > wrote: > > Here I have added what I had programmed. > > With sklearn's LogisticRegression(), how can I see the > parameters it has found after .fit() where the cost is > minimal? I use the book of Geron about scikit-learn and > tensorflow and on page 137 he trains the model of petal > widths. I did the following: > > ? ? iris=datasets.load_iris() > ? ? a1=iris['data'][:,3:] > ? ? y=(iris['target']==2).astype(int) > ? ? log_reg=LogisticRegression() > ? ? log_reg.fit(a1,y) > > ? ? log_reg.coef_ > ? ? array([[2.61727777]]) > ? ? log_reg.intercept_ > ? ? array([-4.2209364]) > > > I did the logistic regression myself with Gradient Descent or > Newton-Raphson as I learned from my Coursera course and > respectively from my book of Bishop. I used the Gradient > Descent method like so: > > ? ? from sklearn import datasets > ? ? iris=datasets.load_iris() > ? ? a1=iris['data'][:,3:] > ? ? A1=np.c_[np.ones((150,1)),a1] > y=(iris['target']==2).astype(int).reshape(-1,1) > ? ? lmda=1 > > ? ? from scipy.special import expit > > ? ? def logreg_gd(w): > ? ? ? z2=A1.dot(w) > ? ? ? a2=expit(z2) > ? ? ? delta2=a2-y > ? ? ? w=w-(lmda/len(a1))*A1.T.dot(delta2) > ? ? ? return w > ? ? w=np.array([[0],[0]]) > ? ? for i in range(0,100000): > ? ? ? w=logreg_gd(w) > > ? ? In [6219]: w > ? ? Out[6219]: > ? ? array([[-21.12563996], > ? ? ? ? ? ?[ 12.94750716]]) > > I used Newton-Raphson like so, see Bishop page 207, > > ? ? from sklearn import datasets > ? ? iris=datasets.load_iris() > ? ? a1=iris['data'][:,3:] > ? ? A1=np.c_[np.ones(len(a1)),a1] > y=(iris['target']==2).astype(int).reshape(-1,1) > ? ? def logreg_nr(w): > ? ? ? z1=A1.dot(w) > ? ? ? y=expit(z1) > ? ? ? R=np.diag((y*(1-y))[:,0]) > ? ? ? H=A1.T.dot(R).dot(A1) > ? ? ? tmp=A1.dot(w)-np.linalg.inv(R).dot(y-t) > v=np.linalg.inv(H).dot(A1.T).dot(R).dot(tmp) > ? ? ? return v > > ? ? w=np.array([[0],[0]]) > ? ? for i in range(0,10): > ? ? ? w=logreg_nr(w) > > ? ? In [5149]: w > ? ? Out[5149]: > ? ? array([[-21.12563996], > ? ? ? ? ? ?[ 12.94750716]]) > > Notice how much faster Newton-Raphson goes than Gradient > Descent. But they give the same result. > > How can I see which parameters LogisticRegression() found? And > should I give LogisticRegression other parameters? > > On Sat, Jun 8, 2019 at 11:34 AM Eric J. Van der Velden > > wrote: > > Hello, > > I am learning sklearn from my book of Geron. On page 137 > he learns the model of petal widths. > > When I implements logistic regression myself as I learned > from my Coursera course or from my book of Bishop I find > that the following parameters are found where the cost > function is minimal: > > In [6219]: w > Out[6219]: > array([[-21.12563996], > ? ? ? ?[ 12.94750716]]) > > I used Gradient Descent and Newton-Raphson, both give the > same answer. > > My question is: how can I see after fit() which parameters > LogisticRegression() has found? > > One other question also: when I read the documentation > page, > https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, > I see a different cost function as I read in the books. > > Thanks. > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericjvandervelden at gmail.com Wed Jun 12 00:18:47 2019 From: ericjvandervelden at gmail.com (Eric J. Van der Velden) Date: Wed, 12 Jun 2019 06:18:47 +0200 Subject: [scikit-learn] LogisticRegression In-Reply-To: <295e5a02-def7-6d85-4a17-0874e191328e@gmail.com> References: <295e5a02-def7-6d85-4a17-0874e191328e@gmail.com> Message-ID: Thanks! Op di 11 jun. 2019 20:48 schreef Andreas Mueller : > > > On 6/11/19 11:47 AM, Eric J. Van der Velden wrote: > > Hi Nicolas, Andrew, > > Thanks! > > I found out that it is the regularization term. Sklearn always has that > term. When I program logistic regression with that term too, with > \lambda=1, I get exactly the same answer as sklearn, when I look at the > parameters you gave me. > > Question is why sklearn always has that term in logistic regression. If > you have enough data, do you need a regularization term? > > It's equivalent to setting C to a high value. > We now allow penalty='none' in logisticregression, see > https://github.com/scikit-learn/scikit-learn/pull/12860 > > I opened an issue on improving the docs: > https://github.com/scikit-learn/scikit-learn/issues/14070 > > feel free to make suggestions there. > > There's more discussion here as well: > https://github.com/scikit-learn/scikit-learn/issues/6738 > > > > Op di 11 jun. 2019 10:08 schreef Andrew Howe : > >> The coef_ attribute of the LogisticRegression object stores the >> parameters. >> >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> LinkedIn Profile >> ResearchGate Profile >> Open Researcher and Contributor ID (ORCID) >> >> Github Profile >> Personal Website >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> >> >> On Sat, Jun 8, 2019 at 6:58 PM Eric J. Van der Velden < >> ericjvandervelden at gmail.com> wrote: >> >>> Here I have added what I had programmed. >>> >>> With sklearn's LogisticRegression(), how can I see the parameters it has >>> found after .fit() where the cost is minimal? I use the book of Geron about >>> scikit-learn and tensorflow and on page 137 he trains the model of petal >>> widths. I did the following: >>> >>> iris=datasets.load_iris() >>> a1=iris['data'][:,3:] >>> y=(iris['target']==2).astype(int) >>> log_reg=LogisticRegression() >>> log_reg.fit(a1,y) >>> >>> log_reg.coef_ >>> array([[2.61727777]]) >>> log_reg.intercept_ >>> array([-4.2209364]) >>> >>> >>> I did the logistic regression myself with Gradient Descent or >>> Newton-Raphson as I learned from my Coursera course and respectively from >>> my book of Bishop. I used the Gradient Descent method like so: >>> >>> from sklearn import datasets >>> iris=datasets.load_iris() >>> a1=iris['data'][:,3:] >>> A1=np.c_[np.ones((150,1)),a1] >>> y=(iris['target']==2).astype(int).reshape(-1,1) >>> lmda=1 >>> >>> from scipy.special import expit >>> >>> def logreg_gd(w): >>> z2=A1.dot(w) >>> a2=expit(z2) >>> delta2=a2-y >>> w=w-(lmda/len(a1))*A1.T.dot(delta2) >>> return w >>> >>> w=np.array([[0],[0]]) >>> for i in range(0,100000): >>> w=logreg_gd(w) >>> >>> In [6219]: w >>> Out[6219]: >>> array([[-21.12563996], >>> [ 12.94750716]]) >>> >>> I used Newton-Raphson like so, see Bishop page 207, >>> >>> from sklearn import datasets >>> iris=datasets.load_iris() >>> a1=iris['data'][:,3:] >>> A1=np.c_[np.ones(len(a1)),a1] >>> y=(iris['target']==2).astype(int).reshape(-1,1) >>> >>> def logreg_nr(w): >>> z1=A1.dot(w) >>> y=expit(z1) >>> R=np.diag((y*(1-y))[:,0]) >>> H=A1.T.dot(R).dot(A1) >>> tmp=A1.dot(w)-np.linalg.inv(R).dot(y-t) >>> v=np.linalg.inv(H).dot(A1.T).dot(R).dot(tmp) >>> return v >>> >>> w=np.array([[0],[0]]) >>> for i in range(0,10): >>> w=logreg_nr(w) >>> >>> In [5149]: w >>> Out[5149]: >>> array([[-21.12563996], >>> [ 12.94750716]]) >>> >>> Notice how much faster Newton-Raphson goes than Gradient Descent. But >>> they give the same result. >>> >>> How can I see which parameters LogisticRegression() found? And should >>> I give LogisticRegression other parameters? >>> >>> On Sat, Jun 8, 2019 at 11:34 AM Eric J. Van der Velden < >>> ericjvandervelden at gmail.com> wrote: >>> >>>> Hello, >>>> >>>> I am learning sklearn from my book of Geron. On page 137 he learns the >>>> model of petal widths. >>>> >>>> When I implements logistic regression myself as I learned from my >>>> Coursera course or from my book of Bishop I find that the following >>>> parameters are found where the cost function is minimal: >>>> >>>> In [6219]: w >>>> Out[6219]: >>>> array([[-21.12563996], >>>> [ 12.94750716]]) >>>> >>>> I used Gradient Descent and Newton-Raphson, both give the same answer. >>>> >>>> My question is: how can I see after fit() which parameters >>>> LogisticRegression() has found? >>>> >>>> One other question also: when I read the documentation page, >>>> https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, >>>> I see a different cost function as I read in the books. >>>> >>>> Thanks. >>>> >>>> >>>> >>>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Wed Jun 12 14:36:42 2019 From: tmrsg11 at gmail.com (C W) Date: Wed, 12 Jun 2019 14:36:42 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: Thank you both for the papers references. @ Andreas, What is your take? And what are you implying? The Breiman (2001) paper points out the black box vs. statistical approach. I call them black box vs. open box. He advocates black box in the paper. Black box: y <--- nature <--- x Open box: y <--- linear regression <---- x Decision trees and neural nets are black box model. They require large amount of data to train, and skip the part where it tries to understand nature. Because it is a black box, you can't open up to see what's inside. Linear regression is a very simple model that you can use to approximate nature, but the key thing is that you need to know how the data are generated. @ Brown, I know nothing about molecular modeling. The paper your linked "Beware of q2!" paper raises some interesting point, as far as I see in sklearn linear regression, score is R^2. On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller wrote: > > On 6/4/19 8:44 PM, C W wrote: > > Thank you all for the replies. > > > > I agree that prediction accuracy is great for evaluating black-box ML > > models. Especially advanced models like neural networks, or > > not-so-black models like LASSO, because they are NP-hard to solve. > > > > Linear regression is not a black-box. I view prediction accuracy as an > > overkill on interpretable models. Especially when you can use > > R-squared, coefficient significance, etc. > > > > Prediction accuracy also does not tell you which feature is important. > > > > What do you guys think? Thank you! > > > Did you read the paper that I sent? ;) > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Jun 13 10:41:39 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 13 Jun 2019 10:41:39 -0400 Subject: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split? In-Reply-To: References: <23948dee-bbaa-df7e-0b13-559a40f7a372@gmail.com> Message-ID: <5ca8e766-a243-fddb-a366-a6cb1a95bd2b@gmail.com> He doesn't only talk about black box vs statistical, he talks about model based vs prediction based. He says that if you validate predictions, you don't need to (necessarily) worry about model misspecification. A linear regression model can be misspecified, and it can be overfit. Just fitting the model will not inform you whether either of these is the case. Because the model is simple and well understood, there is ways to check model misspecification and overfit in several ways. A train-test-split doesn't exactly tell you whether the model is misspecified (errors could be non-normal and prediction could still be good), but it gives you an idea if the model is "useful". Basically: you need to validate whatever you did. There are model-based approaches and there are prediction based approaches. Prediction based approaches are always applicable, model-based approaches are usually more limited and harder to do (but if you find a good model you got a model of the process, which is great!). But you need to pick at least one of the two approaches. On 6/12/19 2:36 PM, C W wrote: > Thank you both for the papers references. > > @ Andreas, > What is your take? And what are you implying? > > The Breiman (2001) paper points out the black box vs. statistical > approach. I call them black box vs. open box. He advocates black box > in the paper. > Black box: > y <--- nature <--- x > > Open box: > y <--- linear regression <---- x > > Decision trees and neural nets are black box model. They require large > amount of data to train, and skip the part where it tries to > understand nature. > > Because it is a black box, you can't open up to see what's inside. > Linear regression is a very simple model that you can use to > approximate nature, but the key thing is that you need to know how the > data are generated. > > @ Brown, > I know nothing about molecular modeling. The paper your linked "Beware > of q2!" paper raises some interesting point, as far as I see in > sklearn linear regression, score is R^2. > > On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller > wrote: > > > On 6/4/19 8:44 PM, C W wrote: > > Thank you all for the replies. > > > > I agree that prediction accuracy is great for evaluating > black-box ML > > models. Especially advanced models like neural networks, or > > not-so-black models like LASSO, because they are NP-hard to solve. > > > > Linear regression is not a black-box. I view prediction accuracy > as an > > overkill on interpretable models. Especially when you can use > > R-squared, coefficient significance, etc. > > > > Prediction accuracy also does not tell you which feature is > important. > > > > What do you guys think? Thank you! > > > Did you read the paper that I sent? ;) > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From np.dong572 at gmail.com Thu Jun 13 11:03:48 2019 From: np.dong572 at gmail.com (Naiping Dong) Date: Thu, 13 Jun 2019 23:03:48 +0800 Subject: [scikit-learn] Concatenate posterior probabilities of different datasets obtained from different models Message-ID: Hi all, I have several small datasets, each is composed by two classes. The posterior probabilities of different datasets are predicted by different models, which are constructed either by different models having the attribute "predict_proba" or the same algorithm trained by different training data. I wonder whether there exists a method to concatenate these probabilities as a single array so that I can do some inferences from much larger number of probabilities. Thanks in advance. Best regards, Elkan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wendley at ufc.br Mon Jun 17 09:27:27 2019 From: wendley at ufc.br (Wendley Silva) Date: Mon, 17 Jun 2019 10:27:27 -0300 Subject: [scikit-learn] How use get_depth Message-ID: Hi all, I tried several ways to use the get_depth() method from DecisionTreeRegression, but I always get the same error: self.clf.*get_depth()* AttributeError: *'DecisionTreeRegressor' object has no attribute 'get_depth'* I researched the internet and found no solution. Any idea how to use it correctly? *Description of get_depth():* https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html Thanks in advance. Best, *Wendley S. Silva* Universidade Federal do Cear? - Brasil +55 (88) 3695.4608 wendley at ufc.br www.ec.ufc.br/wendley Rua Cel. Estanislau Frota, 563, Centro, Sobral-CE, Brasil - CEP 62.010-560 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Jun 17 09:41:08 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 17 Jun 2019 22:41:08 +0900 Subject: [scikit-learn] How use get_depth In-Reply-To: References: Message-ID: Perhaps you mean: DecisionTreeRegressor.tree_.max_depth , where DecisionTreeRegressor.tree_ is available after calling fit() ? 2019?6?17?(?) 22:29 Wendley Silva : > Hi all, > > I tried several ways to use the get_depth() method from > DecisionTreeRegression, but I always get the same error: > > self.clf.*get_depth()* > AttributeError: *'DecisionTreeRegressor' object has no attribute > 'get_depth'* > > I researched the internet and found no solution. Any idea how to use it > correctly? > > *Description of get_depth():* > > https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html > > Thanks in advance. > > Best, > *Wendley S. Silva* > Universidade Federal do Cear? - Brasil > > +55 (88) 3695.4608 > wendley at ufc.br > www.ec.ufc.br/wendley > Rua Cel. Estanislau Frota, 563, Centro, Sobral-CE, Brasil - CEP 62.0 > 10-560 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Jun 17 09:43:35 2019 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 17 Jun 2019 15:43:35 +0200 Subject: [scikit-learn] How use get_depth In-Reply-To: References: Message-ID: The function is added in the latest release, you probably need to update the package and then you should have it. On Mon., Jun. 17, 2019, 15:42 Brown J.B. via scikit-learn, < scikit-learn at python.org> wrote: > Perhaps you mean: > DecisionTreeRegressor.tree_.max_depth , where DecisionTreeRegressor.tree_ > is available after calling fit() ? > > > 2019?6?17?(?) 22:29 Wendley Silva : > >> Hi all, >> >> I tried several ways to use the get_depth() method from >> DecisionTreeRegression, but I always get the same error: >> >> self.clf.*get_depth()* >> AttributeError: *'DecisionTreeRegressor' object has no attribute >> 'get_depth'* >> >> I researched the internet and found no solution. Any idea how to use it >> correctly? >> >> *Description of get_depth():* >> >> https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html >> >> Thanks in advance. >> >> Best, >> *Wendley S. Silva* >> Universidade Federal do Cear? - Brasil >> >> +55 (88) 3695.4608 >> wendley at ufc.br >> www.ec.ufc.br/wendley >> Rua Cel. Estanislau Frota, 563, Centro, Sobral-CE, Brasil - CEP 62.0 >> 10-560 >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonatania at gmail.com Tue Jun 18 08:43:00 2019 From: jonatania at gmail.com (=?utf-8?Q?Jonatan_G=C3=B8ttcke?=) Date: Tue, 18 Jun 2019 14:43:00 +0200 Subject: [scikit-learn] Semi-supervised learning contributions Message-ID: <5d08dc53.1c69fb81.de8ab.0d06@mx.google.com> Hi Scikit-Learn developers. I?m a masters student in Data Mining and Machine Learning from Denmark. I?m finishing my masters in two months time. As a part of my thesis I?m comparing a theoretically sound and interesting semi-supervised algorithm I came up with, to a bunch of other algorithms. The problem was, that all of these basic graph based algorithms didn?t exist in scikit-learn, so I?ve implemented them following a scikit-learn like ?methodology?, but it?s not 100% compatible. Compatibility will require a bit more work, but dependant on how much it is (I don?t know cause I?ve never contributed to scikit-learn before), I might have time to put that in before the deadline (or I could do it as a part of my PhD, and get that as an easy first publication). I?m sure you are more interested in getting the basic graph algorithms in there, than my own interesting (and still unpublished) methods ?. I would however love to hear if you guys are interested in getting this contribution into scikit-learn.? Cheers Jonatan M. G?ttcke CEO @ OpGo +45 23 65 01 96 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Jun 18 10:08:35 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 18 Jun 2019 10:08:35 -0400 Subject: [scikit-learn] Semi-supervised learning contributions In-Reply-To: <5d08dc53.1c69fb81.de8ab.0d06@mx.google.com> References: <5d08dc53.1c69fb81.de8ab.0d06@mx.google.com> Message-ID: <20190618140835.h3uschq3uvh7b35c@phare.normalesup.org> Hi Jonathan, This is very interesting. However, the bar in terms of quality and scope of scikit-learn is very high. The best way to move forward is to build a package outside of scikit-learn, possibly in scikit-learn-contrib, and maybe in the longer run, consider contributing some methods to scikit-learn. In terms of what algorithms and method can go in scikit-learn, the criteria are written here: https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms Keep in mind that new methods cannot go in: scikit-learn is only for very established work. Cheers, Ga?l On Tue, Jun 18, 2019 at 02:43:00PM +0200, Jonatan G?ttcke wrote: > Hi Scikit-Learn developers. > I?m a masters student in Data Mining and Machine Learning from Denmark. I?m > finishing my masters in two months time. > As a part of my thesis I?m comparing a theoretically sound and interesting > semi-supervised algorithm I came up with, to a bunch of other algorithms. The > problem was, that all of these basic graph based algorithms didn?t exist in > scikit-learn, so I?ve implemented them following a scikit-learn like > ?methodology?, but it?s not 100% compatible. Compatibility will require a bit > more work, but dependant on how much it is (I don?t know cause I?ve never > contributed to scikit-learn before), I might have time to put that in before > the deadline (or I could do it as a part of my PhD, and get that as an easy > first publication). I?m sure you are more interested in getting the basic graph > algorithms in there, than my own interesting (and still unpublished) methods ?? > . > I would however love to hear if you guys are interested in getting this > contribution into scikit-learn.? > Cheers > Jonatan M. G?ttcke > CEO @ OpGo > +45 23 65 01 96 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From gael.varoquaux at normalesup.org Tue Jun 18 13:16:06 2019 From: gael.varoquaux at normalesup.org (gael.varoquaux at normalesup.org) Date: Tue, 18 Jun 2019 13:16:06 -0400 Subject: [scikit-learn] Semi-supervised methods In-Reply-To: <201906181649.x5IGn0YE020733@nef.ens.fr> References: <201906181649.x5IGn0YE020733@nef.ens.fr> Message-ID: <20190618171606.w4yq6hy3lcm3cgnv@phare.normalesup.org> Hi Jonathan, It's very important that you keep discussions on the list, to keep everybody informed, and also to make sure that I am not the bottleneck (I deal terribly with email). Once again, I want to stress that getting code in scikit-learn is a long process (maybe unfortunately). I think that working on a package getting these algorithm out first, before trying to move some upstream to scikit-learn, is the best option. I am saying this despite the fact that I really want scikit-learn to grow and consolidate useful algorithms in one package. It's just a question of being efficient. Cheers, Ga?l On Tue, Jun 18, 2019 at 06:31:13PM +0200, Jonatan G?ttcke wrote: > I?ve been reading the sites on scikit-learn now, and my methods actually follow > the methodology of .fit and .predict and all of the graph-methods implemented, > are the very fundamental and established graph approaches for semi-supervised > learning as described by Zhu & Gholdberg in their ?Introduction to > semi-supervised learning?. > Even though the methods fit the bill very well, do you think I should push it > to scikit-learn contrib? ?And is there a graph algorithm Expert in the Group, > or a semi-supervised maintainer or something, that I can discuss my > implemenations with ? > Thanks for getting back so quickly btw. > Cheers > Jonatan M. G?ttcke > CEO @ OpGo > +45 23 65 01 96 -- Gael Varoquaux Senior Researcher, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From m.caorsi at l2f.ch Wed Jun 19 05:56:29 2019 From: m.caorsi at l2f.ch (Matteo Caorsi) Date: Wed, 19 Jun 2019 09:56:29 +0000 Subject: [scikit-learn] Full-time job opportunity -- software engineer for open source project Message-ID: Who we are: L2F is a start-up based on the EPFL Innovation Park (Lausanne, CH). We are currently working at the frontier of machine learning and topological data analysis, in collaboration with several academic partners. Our Mission: We are developing an open source library implementing new topological data analysis algorithms that are being designed by our research team. The library shall be user-friendly, well documented, high-performance and well integrated with state-of-the-art machine learning libraries (such as scikit-learn and Keras). We are offering a full-time job in our company to help us develop this library. The candidate will work in the L2F research team. Profile description: We are looking for a computer scientist matching these characteristics: * 2+ years of experience in software engineering * Skilled with Python and C++ (in particular, at ease wrapping C++ code for Python) * Aware of how open source communities work. Better if he/she contributed in open-source collaborations, such as scikit-learn. * At ease writing specifications, developer documentation and good user documentation * Fluent with continuous integration, Git and common developer tools * Skilled in testing architectures (unit tests, integration tests,?) How to apply: Applicants can write an e-mail to Dr. Matteo Caorsi (m.caorsi at l2f.ch ) attaching their CV and a short letter detailing their relevant experience and motivation. Starting date: This position is available for immediate start for the right candidate. -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at ime.eb.br Wed Jun 19 16:36:39 2019 From: reismc at ime.eb.br (Mauricio Reis) Date: Wed, 19 Jun 2019 17:36:39 -0300 Subject: [scikit-learn] Scikit Learn in a Cray computer Message-ID: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> I'd like to understand how parallelism works in the DBScan routine in SciKit Learn running on the Cray computer and what should I do to improve the results I'm looking at. I have adapted the existing example in [https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py] to run with 100,000 points and thus enable one processing time allowing reasonable evaluation of times obtained. I changed the parameter "n_jobs = x", "x" ranging from 1 to 6. I repeated several times the same experiments and calculated the average values ??of the processing time. n_jobs time 1 21,3 2 15,1 3 14,8 4 15,2 5 15,5 6 15,0 I then get the times that appear in the table above and in the attached image. As can be seen, there was only effective gain when "n_jobs = 2" and no difference for larger quantities. And yet, the gain was only less than 30%!! Why were the gains so small? Why was there no greater gain for a greater value of the "n_jobs" parameter? Is it possible to improve the results I have obtained? -- Ats., Mauricio Reis -------------- next part -------------- A non-text attachment was scrubbed... Name: Time_X_CPUs (Cray).jpg Type: image/jpeg Size: 23348 bytes Desc: not available URL: From olivier.grisel at ensta.org Wed Jun 19 16:44:20 2019 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 19 Jun 2019 21:44:20 +0100 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> Message-ID: How many cores du you have on this machine? joblib.cpu_count() -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at ime.eb.br Wed Jun 19 19:14:41 2019 From: reismc at ime.eb.br (Mauricio Reis) Date: Wed, 19 Jun 2019 20:14:41 -0300 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> Message-ID: <90d307d1f99cd87e6ac78c365e8f2e51@ime.eb.br> I can not access the Cray computer at this moment to run the suggested code. Once you have access, I'll let you know. But documentation (provided by a teacher in charge of the Cray computer) shows: - 10 blades - 4 nodes per blade = 40 nodes - each node: 1 CPU, 1 GPU, 32 GBytes --- Ats., Mauricio Reis Em 19/06/2019 17:44, Olivier Grisel escreveu: > How many cores du you have on this machine? > > joblib.cpu_count() > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jbbrown at kuhp.kyoto-u.ac.jp Wed Jun 19 20:04:54 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 20 Jun 2019 09:04:54 +0900 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: <90d307d1f99cd87e6ac78c365e8f2e51@ime.eb.br> References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <90d307d1f99cd87e6ac78c365e8f2e51@ime.eb.br> Message-ID: 2019?6?20?(?) 8:16 Mauricio Reis : > But documentation (provided by a teacher in charge of the Cray computer) > shows: > - each node: 1 CPU, 1 GPU, 32 GBytes > If that's true, then it appears to me that running on any individual compute host (node) has 1-core / 2-threads, and that would be why you wouldn't get any more performance after n_jobs=2. For n_jobs=3/4/..., you're just asking the same amount of compute hardware to do the same calculations. As instructed, you'll need to execute joblib.cpu_count() to resolve your host environment. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Thu Jun 20 10:33:17 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 20 Jun 2019 22:33:17 +0800 Subject: [scikit-learn] Is there any general way to make clustering huge time-series dataset better? Message-ID: I have a huge time-series dataset and should load batch by batch. My procedures like below: Scale to (0~1) Shuffle (because I use Birch not MiniBatchKMeans) Train Birch model with partial_fit Evaluate with silhouette_score (large is better) Why I use Birch is because it have partial_fit and no need to specify the cluster number But...I found evaluting by silhouette_score and db score, it will cluster with fewer cluster numbers. When I look into the data, it should cluster more than the clustering results. Should I change the evaluating way? or else? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli1 at gmail.com Mon Jun 24 04:01:39 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Mon, 24 Jun 2019 09:01:39 +0100 Subject: [scikit-learn] titanic dataset, use for book Message-ID: Hello Scikit-learn team, I am currently writing a book for Packt on feature engineering, where I plan to show how to use the newest sklearn transformers. Could I confirm with you whether I can use the titanic dataset located here: titanic_url = ('https://raw.githubusercontent.com/amueller/' 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv') in the book? The code where I found it, seems to have a BSD license, but I am not sure whether the license extends to the use of the dataset as well. https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py Many thanks and I look forward to hearing from you Kind regards Sole -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jun 25 11:04:11 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 25 Jun 2019 11:04:11 -0400 Subject: [scikit-learn] titanic dataset, use for book In-Reply-To: References: Message-ID: Hi Sole. I would suggest not to use this version of the titanic dataset. It's a personal repository of mine and might not exist forever. Ideally you (and we) would use fetch_openml. However, the current version doesn't have support for returning dataframes. That's addressed in https://github.com/scikit-learn/scikit-learn/pull/13902 which is not merged yet. By the time your book comes out, it's likely to be merged, but might not be released, depending on your timeline. It might be easier for your to upload the CSV file to a repository you control yourself. Best, Andy On 6/24/19 4:01 AM, Sole Galli wrote: > Hello Scikit-learn team, > > I am currently writing a book for Packt on feature engineering, where > I plan to show how to use the newest sklearn transformers. > > Could I confirm with you whether I can use the titanic dataset located > here: > titanic_url = ('https://raw.githubusercontent.com/amueller/' > 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv') > > in the book? > > The code where I found it, seems to have a BSD license, but I am not > sure whether the license extends to the use of the dataset as well. > https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py > > > ?Many thanks and I look forward?to hearing from you > > Kind regards > > Sole > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jun 26 06:32:33 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 26 Jun 2019 18:32:33 +0800 Subject: [scikit-learn] Any way to pre-calculate number of cluster roughly? Message-ID: I see many ways like elbow method, silhouette score, they all define the cluster number after clustering. Especially the elbow method, I need to monitor the relation with cluster number and find the elbow. But if the dataset is too huge to let me find the elbow and I don't even how many cluster number actually. Any way to pre-calculate number of cluster roughly? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamie.bull at oco-carbon.com Wed Jun 26 10:59:58 2019 From: jamie.bull at oco-carbon.com (Jamie Bull) Date: Wed, 26 Jun 2019 16:59:58 +0200 Subject: [scikit-learn] Any way to pre-calculate number of cluster roughly? In-Reply-To: References: Message-ID: A common rule of thumb is number of clusters = sqrt(number of items/2) http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf On Wed, 26 Jun 2019 at 12:32, lampahome wrote: > I see many ways like elbow method, silhouette score, they all define the > cluster number after clustering. > > Especially the elbow method, I need to monitor the relation with cluster > number and find the elbow. > > But if the dataset is too huge to let me find the elbow and I don't even > how many cluster number actually. > > Any way to pre-calculate number of cluster roughly? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jun 26 21:32:58 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 27 Jun 2019 09:32:58 +0800 Subject: [scikit-learn] Any way to pre-calculate number of cluster roughly? In-Reply-To: References: Message-ID: Jamie Bull ? 2019?6?26? ?? ??11:02??? > A common rule of thumb is number of clusters = sqrt(number of items/2) > http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf > >> >> If I found it the number is too much, how to merge those groups? Calculate each silhouette score of groups or else? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Thu Jun 27 06:40:09 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 27 Jun 2019 18:40:09 +0800 Subject: [scikit-learn] Any drawbacks when using partial_fit? Message-ID: I try to use Birch to cluster time-series data incrementally. Because insufficient memory, so I train it batch by batch. Every batch is 1000 samples and for 50 batch. I found when I only train the first batch, it cluster well. After first trained, I train following batch with the same model and use partial_fit to train them. I found the clustering result become worse after I trained many rounds until finish. Some samples will mix into another cluster which that seems very different with another samples in the same cluster. Is there any way to make it better? Or I use the wrong method? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Thu Jun 27 07:07:27 2019 From: rth.yurchak at pm.me (Roman Yurchak) Date: Thu, 27 Jun 2019 11:07:27 +0000 Subject: [scikit-learn] titanic dataset, use for book In-Reply-To: References: Message-ID: <3vAXPRorApPTLfmswEU7NqPtsUAi82W7ehW9IEjVSdvqwxictDFDYIPLtPeAg8oSBJ9NIaAbDFhXi-5AzaXEJU1C7ix_Kzm9MzLmHle01xw=@pm.me> Meanwhile, loading the CSV from OpenML (https://www.openml.org/d/40945) would also work, pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') -- Roman On 25/06/2019 17:04, Andreas Mueller wrote: > By the time your book comes out, it's likely to be merged, but might not > be released, depending on your timeline. > It might be easier for your to upload the CSV file to a repository you > control yourself. From reismc at ime.eb.br Thu Jun 27 19:56:01 2019 From: reismc at ime.eb.br (Mauricio Reis) Date: Thu, 27 Jun 2019 20:56:01 -0300 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> Message-ID: <676dd88da2829a7589629a735a2191d9@ime.eb.br> Finally I was able to access the Cray computer and run the routine. I am sending below the files and commands I used and the result found, where you can see "ncpus = 1" (I still do not know why 4 lines were printed - I only know that this amount depends on the value of the "aprun" command used in the file "ncpus.pbs"). But I do not know if you know the Cray computer environment and you'll understand what I did! I use Cray XK7 computer which has 10 blades, each blade has 4 nodes (total of 40 nodes) and each node has 1 CPU and 1 GPU! --- Ats., Mauricio Reis ---------------------------------------------------------------------------------------------- === p.sh === #!/bin/bash /usr/local/python_3.7/bin/python3.7 $1 === ncpus.py === from sklearn.externals import joblib import sklearn print('The scikit-learn version is {}.'.format(sklearn.__version__)) ncpus = joblib.cpu_count() print("--- ncpus =", ncpus) === ncpus.pbs === #!/bin/bash #PBS -l select=1:ncpus=8:mpiprocs=8 #PBS -j oe #PBS -l walltime=00:00:10 date echo "[$PBS_O_WORKDIR]" cd $PBS_O_WORKDIR aprun -n 4 p.sh ./ncpus.py === command === qsub ncpus.pbs === output === Thu Jun 27 05:22:35 BRT 2019 [/home/reismc] The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. The scikit-learn version is 0.20.3. --- ncpus = 1 --- ncpus = 1 --- ncpus = 1 --- ncpus = 1 Application 32826 resources: utime ~8s, stime ~1s, Rss ~43168, inblocks ~102981, outblocks ~0 ---------------------------------------------------------------------------------------------- Em 19/06/2019 17:44, Olivier Grisel escreveu: > How many cores du you have on this machine? > > joblib.cpu_count() > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From roxana.danger at gmail.com Fri Jun 28 01:27:01 2019 From: roxana.danger at gmail.com (Roxana Danger) Date: Fri, 28 Jun 2019 06:27:01 +0100 Subject: [scikit-learn] baggingClassifier with pipeline Message-ID: Hello, I would like to use the BaggingClassifier whose base estimator is a pipeline with multiple transformations including a DataFrameMapper from sklearn_pandas. I am getting an error during the fitting the DataFrameMapper as the first step of the BaggingClassifier is to convert the DataFrame to an array (see in BaseBagging._fit method). Similar problem happen using directly sklearn.Pipeline instead of the DataFrameMapper. in both cases, a DataFrame is expected as input, but, instead, an array is provided to the Pipeline. Is there anyway I can overcome this problem? Many thanks, Roxana -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Fri Jun 28 05:29:42 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Fri, 28 Jun 2019 18:29:42 +0900 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: <676dd88da2829a7589629a735a2191d9@ime.eb.br> References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <676dd88da2829a7589629a735a2191d9@ime.eb.br> Message-ID: > > where you can see "ncpus = 1" (I still do not know why 4 lines were > printed - > > (total of 40 nodes) and each node has 1 CPU and 1 GPU! > > #PBS -l select=1:ncpus=8:mpiprocs=8 > aprun -n 4 p.sh ./ncpus.py > You can request 8 CPUs from a job scheduler, but if each node the script runs on contains only one virtual/physical core, then cpu_count() will return 1. If that CPU supports multi-threading, you would typically get 2. For example, on my workstation: `--> egrep "processor|model name|core id" /proc/cpuinfo processor : 0 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 1 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 processor : 2 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 0 processor : 3 model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz core id : 1 `--> python3 -c "from sklearn.externals import joblib; print(joblib.cpu_count())" 4 It seems that in this situation, if you're wanting to parallelize *independent* sklearn calculations (e.g., changing dataset or random seed), you'll ask for the MPI by PBS processes like you have, but you'll need to place the sklearn computations in a function and then take care of distributing that function call across the MPI processes. Then again, if the runs are independent, it's a lot easier to write a for loop in a shell script that changes the dataset/seed and submits it to the job scheduler to let the job handler take care of the parallel distribution. (I do this when performing 10+ independent runs of sklearn modeling, where models use multiple threads during calculations; in my case, SLURM then takes care of finding the available nodes to distribute the work to.) Hope this helps. J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli1 at gmail.com Fri Jun 28 08:31:35 2019 From: solegalli1 at gmail.com (Sole Galli) Date: Fri, 28 Jun 2019 13:31:35 +0100 Subject: [scikit-learn] titanic dataset, use for book In-Reply-To: <3vAXPRorApPTLfmswEU7NqPtsUAi82W7ehW9IEjVSdvqwxictDFDYIPLtPeAg8oSBJ9NIaAbDFhXi-5AzaXEJU1C7ix_Kzm9MzLmHle01xw=@pm.me> References: <3vAXPRorApPTLfmswEU7NqPtsUAi82W7ehW9IEjVSdvqwxictDFDYIPLtPeAg8oSBJ9NIaAbDFhXi-5AzaXEJU1C7ix_Kzm9MzLmHle01xw=@pm.me> Message-ID: Thank you! that's very helpful :) On Thu, 27 Jun 2019 at 12:27, Roman Yurchak via scikit-learn < scikit-learn at python.org> wrote: > Meanwhile, loading the CSV from OpenML (https://www.openml.org/d/40945) > would also work, > > pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') > > -- > Roman > > On 25/06/2019 17:04, Andreas Mueller wrote: > > By the time your book comes out, it's likely to be merged, but might not > > be released, depending on your timeline. > > It might be easier for your to upload the CSV file to a repository you > > control yourself. > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Fri Jun 28 09:36:22 2019 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 28 Jun 2019 15:36:22 +0200 Subject: [scikit-learn] baggingClassifier with pipeline In-Reply-To: References: Message-ID: You can always add a first step that turns you numpy array into a DataFrame such as the one required afterwards. A bit of object oriented programming might be required though, for deriving you class from BaseTransformer and writing you particular code for fit and transform method. Alternatively you can try the PipeGraph library for dealing with those complex routes. Best Manuel Disclaimer: yes, I'm a coauthour of the PipeGraph library. El vie., 28 jun. 2019 7:28, Roxana Danger escribi?: > Hello, > I would like to use the BaggingClassifier whose base estimator is a > pipeline with multiple transformations including a DataFrameMapper from > sklearn_pandas. > I am getting an error during the fitting the DataFrameMapper as the first > step of the BaggingClassifier is to convert the DataFrame to an array (see > in BaseBagging._fit method). Similar problem happen using directly > sklearn.Pipeline instead of the DataFrameMapper. in both cases, a DataFrame > is expected as input, but, instead, an array is provided to the Pipeline. > > Is there anyway I can overcome this problem? > > Many thanks, > Roxana > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reismc at ime.eb.br Fri Jun 28 15:47:45 2019 From: reismc at ime.eb.br (Mauricio Reis) Date: Fri, 28 Jun 2019 16:47:45 -0300 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <676dd88da2829a7589629a735a2191d9@ime.eb.br> Message-ID: My laptop has Intel I7 processor with 4 cores. When I run the program on Windows 10, the "joblib.cpu_count()" routine returns "4". In these cases, the same test I did on the Cray computer caused a 10% increase in the processing time of the DBScan routine when I used the "n_jobs = 4" parameter compared to the processing time of that routine without this parameter. Do you know what is the cause of the longer processing time when I use "n_jobs = 4" on my laptop? --- Ats., Mauricio Reis Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu: >> where you can see "ncpus = 1" (I still do not know why 4 lines were >> printed - >> >> (total of 40 nodes) and each node has 1 CPU and 1 GPU! > >> #PBS -l select=1:ncpus=8:mpiprocs=8 >> aprun -n 4 p.sh ./ncpus.py > > You can request 8 CPUs from a job scheduler, but if each node the > script runs on contains only one virtual/physical core, then > cpu_count() will return 1. > If that CPU supports multi-threading, you would typically get 2. > > For example, on my workstation: > `--> egrep "processor|model name|core id" /proc/cpuinfo > processor : 0 > model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > core id : 0 > processor : 1 > model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > core id : 1 > processor : 2 > model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > core id : 0 > processor : 3 > model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > core id : 1 > `--> python3 -c "from sklearn.externals import joblib; > print(joblib.cpu_count())" > 4 > > It seems that in this situation, if you're wanting to parallelize > *independent* sklearn calculations (e.g., changing dataset or random > seed), you'll ask for the MPI by PBS processes like you have, but > you'll need to place the sklearn computations in a function and then > take care of distributing that function call across the MPI processes. > > Then again, if the runs are independent, it's a lot easier to write a > for loop in a shell script that changes the dataset/seed and submits > it to the job scheduler to let the job handler take care of the > parallel distribution. > (I do this when performing 10+ independent runs of sklearn modeling, > where models use multiple threads during calculations; in my case, > SLURM then takes care of finding the available nodes to distribute the > work to.) > > Hope this helps. > J.B. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From reismc at ime.eb.br Fri Jun 28 16:04:07 2019 From: reismc at ime.eb.br (Mauricio Reis) Date: Fri, 28 Jun 2019 17:04:07 -0300 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <676dd88da2829a7589629a735a2191d9@ime.eb.br> Message-ID: <21a7d7a108850b9e756ab42d668b23b9@ime.eb.br> Sorry, but just now I reread your answer more closely. It seems that the "n_jobs" parameter of the DBScan routine brings no benefit to performance. If I want to improve the performance of the DBScan routine I will have to redesign the solution to use MPI resources. Is it correct? --- Ats., Mauricio Reis Em 28/06/2019 16:47, Mauricio Reis escreveu: > My laptop has Intel I7 processor with 4 cores. When I run the program > on Windows 10, the "joblib.cpu_count()" routine returns "4". In these > cases, the same test I did on the Cray computer caused a 10% increase > in the processing time of the DBScan routine when I used the "n_jobs = > 4" parameter compared to the processing time of that routine without > this parameter. Do you know what is the cause of the longer processing > time when I use "n_jobs = 4" on my laptop? > > --- > Ats., > Mauricio Reis > > Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu: >>> where you can see "ncpus = 1" (I still do not know why 4 lines were >>> printed - >>> >>> (total of 40 nodes) and each node has 1 CPU and 1 GPU! >> >>> #PBS -l select=1:ncpus=8:mpiprocs=8 >>> aprun -n 4 p.sh ./ncpus.py >> >> You can request 8 CPUs from a job scheduler, but if each node the >> script runs on contains only one virtual/physical core, then >> cpu_count() will return 1. >> If that CPU supports multi-threading, you would typically get 2. >> >> For example, on my workstation: >> `--> egrep "processor|model name|core id" /proc/cpuinfo >> processor : 0 >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> core id : 0 >> processor : 1 >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> core id : 1 >> processor : 2 >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> core id : 0 >> processor : 3 >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> core id : 1 >> `--> python3 -c "from sklearn.externals import joblib; >> print(joblib.cpu_count())" >> 4 >> >> It seems that in this situation, if you're wanting to parallelize >> *independent* sklearn calculations (e.g., changing dataset or random >> seed), you'll ask for the MPI by PBS processes like you have, but >> you'll need to place the sklearn computations in a function and then >> take care of distributing that function call across the MPI processes. >> >> Then again, if the runs are independent, it's a lot easier to write a >> for loop in a shell script that changes the dataset/seed and submits >> it to the job scheduler to let the job handler take care of the >> parallel distribution. >> (I do this when performing 10+ independent runs of sklearn modeling, >> where models use multiple threads during calculations; in my case, >> SLURM then takes care of finding the available nodes to distribute the >> work to.) >> >> Hope this helps. >> J.B. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From roxana.danger at gmail.com Fri Jun 28 16:17:41 2019 From: roxana.danger at gmail.com (Roxana Danger) Date: Fri, 28 Jun 2019 21:17:41 +0100 Subject: [scikit-learn] baggingClassifier with pipeline In-Reply-To: References: Message-ID: Hi Manuel, thanks for your reply, before trying an alternative as PipeGraph, or implementing the class as you propose, I would prefer to include some code in the _fit method of BaggingClassifier, so the correct value of X can be passed to the base_estimator (the dataframe or its array of values). Many thanks in advance, Roxna On Fri, Jun 28, 2019 at 2:39 PM Manuel CASTEJ?N LIMAS via scikit-learn < scikit-learn at python.org> wrote: > You can always add a first step that turns you numpy array into a > DataFrame such as the one required afterwards. > A bit of object oriented programming might be required though, for > deriving you class from BaseTransformer and writing you particular code for > fit and transform method. > Alternatively you can try the PipeGraph library for dealing with those > complex routes. > Best > Manuel > Disclaimer: yes, I'm a coauthour of the PipeGraph library. > > El vie., 28 jun. 2019 7:28, Roxana Danger > escribi?: > >> Hello, >> I would like to use the BaggingClassifier whose base estimator is a >> pipeline with multiple transformations including a DataFrameMapper from >> sklearn_pandas. >> I am getting an error during the fitting the DataFrameMapper as the first >> step of the BaggingClassifier is to convert the DataFrame to an array (see >> in BaseBagging._fit method). Similar problem happen using directly >> sklearn.Pipeline instead of the DataFrameMapper. in both cases, a DataFrame >> is expected as input, but, instead, an array is provided to the Pipeline. >> >> Is there anyway I can overcome this problem? >> >> Many thanks, >> Roxana >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Sat Jun 29 02:43:37 2019 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Sat, 29 Jun 2019 08:43:37 +0200 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: <21a7d7a108850b9e756ab42d668b23b9@ime.eb.br> References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <676dd88da2829a7589629a735a2191d9@ime.eb.br> <21a7d7a108850b9e756ab42d668b23b9@ime.eb.br> Message-ID: You have to use a dedicated framework to distribute the computation on a cluster like you cray system. You can use mpi, or dask with dask-jobqueue but the also need to run parallel algorithms that are efficient when running in a distributed with a high cost for communication between distributed worker nodes. I am not sure that the dbscan implementation in scikit-learn would benefit much from naively running in distributed mode. Le ven. 28 juin 2019 22 h 06, Mauricio Reis a ?crit : > Sorry, but just now I reread your answer more closely. > > It seems that the "n_jobs" parameter of the DBScan routine brings no > benefit to performance. If I want to improve the performance of the > DBScan routine I will have to redesign the solution to use MPI > resources. > > Is it correct? > > --- > Ats., > Mauricio Reis > > Em 28/06/2019 16:47, Mauricio Reis escreveu: > > My laptop has Intel I7 processor with 4 cores. When I run the program > > on Windows 10, the "joblib.cpu_count()" routine returns "4". In these > > cases, the same test I did on the Cray computer caused a 10% increase > > in the processing time of the DBScan routine when I used the "n_jobs = > > 4" parameter compared to the processing time of that routine without > > this parameter. Do you know what is the cause of the longer processing > > time when I use "n_jobs = 4" on my laptop? > > > > --- > > Ats., > > Mauricio Reis > > > > Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu: > >>> where you can see "ncpus = 1" (I still do not know why 4 lines were > >>> printed - > >>> > >>> (total of 40 nodes) and each node has 1 CPU and 1 GPU! > >> > >>> #PBS -l select=1:ncpus=8:mpiprocs=8 > >>> aprun -n 4 p.sh ./ncpus.py > >> > >> You can request 8 CPUs from a job scheduler, but if each node the > >> script runs on contains only one virtual/physical core, then > >> cpu_count() will return 1. > >> If that CPU supports multi-threading, you would typically get 2. > >> > >> For example, on my workstation: > >> `--> egrep "processor|model name|core id" /proc/cpuinfo > >> processor : 0 > >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > >> core id : 0 > >> processor : 1 > >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > >> core id : 1 > >> processor : 2 > >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > >> core id : 0 > >> processor : 3 > >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz > >> core id : 1 > >> `--> python3 -c "from sklearn.externals import joblib; > >> print(joblib.cpu_count())" > >> 4 > >> > >> It seems that in this situation, if you're wanting to parallelize > >> *independent* sklearn calculations (e.g., changing dataset or random > >> seed), you'll ask for the MPI by PBS processes like you have, but > >> you'll need to place the sklearn computations in a function and then > >> take care of distributing that function call across the MPI processes. > >> > >> Then again, if the runs are independent, it's a lot easier to write a > >> for loop in a shell script that changes the dataset/seed and submits > >> it to the job scheduler to let the job handler take care of the > >> parallel distribution. > >> (I do this when performing 10+ independent runs of sklearn modeling, > >> where models use multiple threads during calculations; in my case, > >> SLURM then takes care of finding the available nodes to distribute the > >> work to.) > >> > >> Hope this helps. > >> J.B. > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From char at upatras.gr Sat Jun 29 09:30:06 2019 From: char at upatras.gr (Christos Aridas) Date: Sat, 29 Jun 2019 16:30:06 +0300 Subject: [scikit-learn] ANN: imbalanced-learn 0.5.0 Message-ID: Hi all, The new release of imbalanced-learn is already available via pip and conda! The changelog can be found here: http://imbalanced-learn.org/en/stable/whats_new.html#version-0-5 The documentation is available at: http://imbalanced-learn.org/en/stable/ Cheers, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlcnworkshop at gmail.com Sun Jun 30 04:14:41 2019 From: mlcnworkshop at gmail.com (MLCN Workshop) Date: Sun, 30 Jun 2019 10:14:41 +0200 Subject: [scikit-learn] [DEADLINE EXTENSION]: The 2nd International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2019): ENTERING THE ERA OF BIG DATA VIA TRANSFER LEARNING AND DATA HARMONIZATION Message-ID: Dear Colleagues,The submission deadline for MLCN 2019 workshop as a satellite event of the MICCAI 2019 conference has been extended to 14/07/2019. For more information, please visit https://mlcnws.com/call-for-papers/ . Call for Papers Recent advances in neuroimaging and machine learning provide an exceptional opportunity for investigators and physicians to discover complex relationships between brain, behaviors, and mental and neurological disorders. The MLCN 2019 workshop (https://mlcnws.com), as a satellite event of MICCAI 2019 (https://www.miccai2019.org), aims to bring together researchers in both theory and application from various fields in domains such as *e.g.* machine learning, neuroimaging, predictive clinical neuroscience, *etc.* Topics of interests include, but are not limited to: - Transfer learning in clinical neuroimaging - Model stability in transfer learning - Data prerequisites for successful transfer learning - Domain adaptation in neuroimaging - Data harmonization across sites - Data pooling ? practical issues - Cross-domain learning in neuroimaging - Interpretability for transfer learning - Unsupervised methods for domain adaptation - Multi-site data analysis, from preprocessing to modeling - Big data in clinical neuroimaging - Scalable machine learning methods - Benefits, problems, and solutions of working with very large datasets SUBMISSION PROCESS: The workshop seeks high quality, original, and unpublished work on algorithms, theory, and applications of machine learning in clinical neuroimaging related to big data, transfer learning, and data harmonization. Papers should be submitted electronically in Springer Lecture Notes in Computer Science (LCNS) style ( https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines) with up to 8-pages and using the CMT system at https://cmt3.research.microsoft.com/MLCN2019. The MLCN workshop uses a double-blind review process in the evaluation phase, thus authors must ensure anonymous submissions. Accepted papers will be published in a joint proceeding with the MICCAI 2019 conference. IMPORTANT DATES: - Paper submission deadline: July 14, 2019 (23:59 PST) - Notification of Acceptance: August 5, 2019 - Camera-ready Submission: August 12, 2019 - Workshop Date: October 13, 2019 Best regards, MLCN 2019 Organizing Committee , Email: mlcnworkshop at gmail.com Website: https://mlcnws.com/ twitter: @MLCNworkshop -------------- next part -------------- An HTML attachment was scrubbed... URL: From shukla.sharma at ensait.fr Sun Jun 30 14:13:02 2019 From: shukla.sharma at ensait.fr (Shukla Sharma) Date: Sun, 30 Jun 2019 20:13:02 +0200 (CEST) Subject: [scikit-learn] BIRCH fro streaming data In-Reply-To: <1168751467.11344122.1561735873785.JavaMail.zimbra@ensait.fr> Message-ID: <2054611059.11454525.1561918382443.JavaMail.zimbra@ensait.fr> Dear Scikit Team, I am working to build a recommendation model on streaming data. Algorithm used: BIRCH # trained with 1000 records birch_model.partial_fit(x) label count is 1000 and cluster count is 4 joblib_file = filename joblib.dump(birch_model, joblib_file) joblib_model = joblib.load(joblib_file) # trained with 100 new records records joblib_model .partial_fit(x) label count is 100 and cluster count is 3 Is it a correct way to apply on streaming data? I could not find any documentation which explains that "when we retrain the model with a new set of records it also carries previous information". Thanks & Regards, Shukla Sharma From desitter.gravity at gmail.com Sun Jun 30 18:20:05 2019 From: desitter.gravity at gmail.com (desitter.gravity at gmail.com) Date: Sun, 30 Jun 2019 15:20:05 -0700 Subject: [scikit-learn] Scikit Learn in a Cray computer In-Reply-To: References: <7fb316393ac7101e1a1809c7fdb4daa6@ime.eb.br> <676dd88da2829a7589629a735a2191d9@ime.eb.br> <21a7d7a108850b9e756ab42d668b23b9@ime.eb.br> Message-ID: Dear All, Alex Lovell-Troy heads up innovation/cloud supercomputing at Cray (cc'd) and he is a great resource for all things. I thought he might find this thread useful. Best, Alex On Fri, Jun 28, 2019 at 11:45 PM Olivier Grisel wrote: > You have to use a dedicated framework to distribute the computation on a > cluster like you cray system. > > You can use mpi, or dask with dask-jobqueue but the also need to run > parallel algorithms that are efficient when running in a distributed with a > high cost for communication between distributed worker nodes. > > I am not sure that the dbscan implementation in scikit-learn would benefit > much from naively running in distributed mode. > > Le ven. 28 juin 2019 22 h 06, Mauricio Reis a ?crit : > >> Sorry, but just now I reread your answer more closely. >> >> It seems that the "n_jobs" parameter of the DBScan routine brings no >> benefit to performance. If I want to improve the performance of the >> DBScan routine I will have to redesign the solution to use MPI >> resources. >> >> Is it correct? >> >> --- >> Ats., >> Mauricio Reis >> >> Em 28/06/2019 16:47, Mauricio Reis escreveu: >> > My laptop has Intel I7 processor with 4 cores. When I run the program >> > on Windows 10, the "joblib.cpu_count()" routine returns "4". In these >> > cases, the same test I did on the Cray computer caused a 10% increase >> > in the processing time of the DBScan routine when I used the "n_jobs = >> > 4" parameter compared to the processing time of that routine without >> > this parameter. Do you know what is the cause of the longer processing >> > time when I use "n_jobs = 4" on my laptop? >> > >> > --- >> > Ats., >> > Mauricio Reis >> > >> > Em 28/06/2019 06:29, Brown J.B. via scikit-learn escreveu: >> >>> where you can see "ncpus = 1" (I still do not know why 4 lines were >> >>> printed - >> >>> >> >>> (total of 40 nodes) and each node has 1 CPU and 1 GPU! >> >> >> >>> #PBS -l select=1:ncpus=8:mpiprocs=8 >> >>> aprun -n 4 p.sh ./ncpus.py >> >> >> >> You can request 8 CPUs from a job scheduler, but if each node the >> >> script runs on contains only one virtual/physical core, then >> >> cpu_count() will return 1. >> >> If that CPU supports multi-threading, you would typically get 2. >> >> >> >> For example, on my workstation: >> >> `--> egrep "processor|model name|core id" /proc/cpuinfo >> >> processor : 0 >> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> >> core id : 0 >> >> processor : 1 >> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> >> core id : 1 >> >> processor : 2 >> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> >> core id : 0 >> >> processor : 3 >> >> model name : Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz >> >> core id : 1 >> >> `--> python3 -c "from sklearn.externals import joblib; >> >> print(joblib.cpu_count())" >> >> 4 >> >> >> >> It seems that in this situation, if you're wanting to parallelize >> >> *independent* sklearn calculations (e.g., changing dataset or random >> >> seed), you'll ask for the MPI by PBS processes like you have, but >> >> you'll need to place the sklearn computations in a function and then >> >> take care of distributing that function call across the MPI processes. >> >> >> >> Then again, if the runs are independent, it's a lot easier to write a >> >> for loop in a shell script that changes the dataset/seed and submits >> >> it to the job scheduler to let the job handler take care of the >> >> parallel distribution. >> >> (I do this when performing 10+ independent runs of sklearn modeling, >> >> where models use multiple threads during calculations; in my case, >> >> SLURM then takes care of finding the available nodes to distribute the >> >> work to.) >> >> >> >> Hope this helps. >> >> J.B. >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Alex Morrise, PhD Co-Founder & CTO, StayOpen.com Chief Science Officer, MediaJel.com Professional Bio: Machine Learning Intelligence -------------- next part -------------- An HTML attachment was scrubbed... URL: