From joel.nothman at gmail.com Sat Mar 2 16:17:39 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 2 Mar 2019 22:17:39 +0100 Subject: [scikit-learn] ANN: Scikit-learn 0.20.3 released Message-ID: A bug fix release of Scikit-learn, version 0.20.3, has been relased. It is not yet on Conda default channel, but should be available on pypi and conda-forge. Thank you to all who contributed. Substantive changes are listed at https://scikit-learn.org/0.20/whats_new.html#version-0-20-3 And after a very successful sprint in Paris, the development of version 0.21 is well under way (https://scikit-learn.org/dev/whats_new.html#version-0-21-0) and we will start working towards its release. Reminder: only Python >= 3.5 will be supported in version 0.21. The scikit-learn developer team From rajnishk7.info at gmail.com Sat Mar 9 13:34:57 2019 From: rajnishk7.info at gmail.com (Rajnish kamboj) Date: Sun, 10 Mar 2019 00:04:57 +0530 Subject: [scikit-learn] Difference in prediction accuracy using SGDClassifier and Cross validation scores. Message-ID: Hi I have recently started machine learning and it is my first query regarding prediction accuracy. There is difference in prediction accuracy using SGDClassifier and Cross validation scores. import numpy as np from sklearn.datasets import fetch_openml from sklearn.linear_model import SGDClassifier mnist = fetch_openml('mnist_784', version=1, cache=True) X, y = mnist['data'], mnist['target'] X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] shuffled_index = np.random.permutation(60000) # shuffle the 0 - 60000 range X_train, y_train = X_train[shuffled_index], y_train[shuffled_index] y_train_5 = (y_train == '5') y_test_5 = (y_test == '5') sgd_clf = SGDClassifier(random_state=42, tol=1e-3, max_iter=1000) sgd_clf.fit(X_train, y_train_5) # Predicting for all 5s print("####### PREDICTION STATS ##############") y_train_5_pred = sgd_clf.predict(X_train) print("Total y_train_5 [False|True both]]:", len(y_train_5)) print("Total y_train_5 [Only 5s]:", sum(y_train_5)) # some other digit may be predicted as 5 and some 5s may be predicted as not 5 print("Predicted 5s:", sum(y_train_5_pred)) correctly_predicted = sum(np.logical_and(y_train_5_pred, y_train_5)) print("Correct Predicted", correctly_predicted) print("Accuracy:", correctly_predicted/sum(y_train_5) * 100) from sklearn.model_selection import cross_val_score cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy') *MY Output* ####### PREDICTION STATS ############## Total y_train_5 [False|True both]]: 60000 Total y_train_5 [Only 5s]: 5421 Predicted 5s: 3863 Correct Predicted 3574*Accuracy: 65.9287954251983* array([*0.9323 , 0.96805, 0.9641* ]) ####################################### So as per my observation there is a difference, why? SGDCLassifier is *~65.92%* accurate cross_val_score are *~95%* Am I comparing it in wrong way? OR I am missing something? Thanks Rajnish -------------- next part -------------- An HTML attachment was scrubbed... URL: From rs2715 at stern.nyu.edu Tue Mar 12 14:45:17 2019 From: rs2715 at stern.nyu.edu (Reshama Shaikh) Date: Tue, 12 Mar 2019 14:45:17 -0400 Subject: [scikit-learn] [WiMLDS scikit-learn] open source sprint in Nairobi, Kenya Message-ID: I am an organizer of the New York City chapter of WiMLDS (Women in Machine Learning & Data Science) (http://wimlds.org). We would like to organize a scikit-learn sprint for our Nairobi chapter, ( https://www.meetup.com/topics/wimlds/all/), which is also our 4th largest of 51 chapters, with 2100+ members. Reference: Impact Report for WiMLDS Scikit-learn Sprints ( https://reshamas.github.io/impact-report-for-wimlds-scikit-learn-sprints/) Would anyone be available to facilitate this event? It would be on a Saturday in June 2019. I can be reached at reshama at wimlds.org for more information. Best, Reshama ---------------------------------------------- Reshama Shaikh NYC WiMLDS NYC PyLadies ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Mar 12 19:19:21 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 13 Mar 2019 10:19:21 +1100 Subject: [scikit-learn] Difference in prediction accuracy using SGDClassifier and Cross validation scores. In-Reply-To: References: Message-ID: You are calculating recall, not accuracy. On Sun, 10 Mar 2019 at 05:36, Rajnish kamboj wrote: > > Hi > > I have recently started machine learning and it is my first query regarding prediction accuracy. > > There is difference in prediction accuracy using SGDClassifier and Cross validation scores. > > import numpy as np > from sklearn.datasets import fetch_openml > from sklearn.linear_model import SGDClassifier > > mnist = fetch_openml('mnist_784', version=1, cache=True) > X, y = mnist['data'], mnist['target'] > X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] > shuffled_index = np.random.permutation(60000) # shuffle the 0 - 60000 range > X_train, y_train = X_train[shuffled_index], y_train[shuffled_index] > > y_train_5 = (y_train == '5') > y_test_5 = (y_test == '5') > > sgd_clf = SGDClassifier(random_state=42, tol=1e-3, max_iter=1000) > sgd_clf.fit(X_train, y_train_5) > > # Predicting for all 5s > print("####### PREDICTION STATS ##############") > y_train_5_pred = sgd_clf.predict(X_train) > > print("Total y_train_5 [False|True both]]:", len(y_train_5)) > print("Total y_train_5 [Only 5s]:", sum(y_train_5)) > > # some other digit may be predicted as 5 and some 5s may be predicted as not 5 > print("Predicted 5s:", sum(y_train_5_pred)) > > correctly_predicted = sum(np.logical_and(y_train_5_pred, y_train_5)) > print("Correct Predicted", correctly_predicted) > print("Accuracy:", correctly_predicted/sum(y_train_5) * 100) > > from sklearn.model_selection import cross_val_score > cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy') > > MY Output > > ####### PREDICTION STATS ############## > Total y_train_5 [False|True both]]: 60000 > Total y_train_5 [Only 5s]: 5421 > Predicted 5s: 3863 > Correct Predicted 3574 > Accuracy: 65.9287954251983 > array([0.9323 , 0.96805, 0.9641 ]) > ####################################### > > So as per my observation there is a difference, why? > > SGDCLassifier is ~65.92% accurate > cross_val_score are ~95% > > Am I comparing it in wrong way? OR I am missing something? > > > Thanks > > Rajnish > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Wed Mar 13 23:45:28 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 14 Mar 2019 11:45:28 +0800 Subject: [scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't? Message-ID: As title, I'm confused why some algo can partial_fit and some algo can't. For regression model, I found SGD can but RF can't. Is about the difference of algo? I thought it's able to partial_fit because gradient descent, or just another reason? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Mar 14 01:28:07 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 14 Mar 2019 00:28:07 -0500 Subject: [scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't? In-Reply-To: References: Message-ID: It's not necessarily unique to stochastic gradient descent, it's more that some other algorithms are generally not well suited for "partial_fit". For SGD, partial fit is a more natural thing to do since you estimate the training loss from minibatches anyway -- i.e., you do SGD step by step anyway. Also, think about it this way: models trained via SGD are typically parametric, so the number of parameters is fixed, and you simply just adjust their values iteratively during training. For nonparametric models, such as RF, the number of parameters (e.g., if you think about each node in the decision tree as a parameter) depends on the examples present in the training set. I.e., how deep each individual decision tree eventually becomes depends on the training set. So, it doesn't make sense to build a decision tree on a few training examples and then update it later by feeding it more training examples. Either way, you would probably end up throwing away the decision tree and build a new one if you get additional data. I am sure solutions for "updating" decision trees exist, which produce somewhat reasonable results efficiently, but it's less natural and not a common thing to do, which is why it's probably not implemented in scikit-learn. Best, Sebastian > On Mar 13, 2019, at 10:45 PM, lampahome wrote: > > As title, I'm confused why some algo can partial_fit and some algo can't. > > For regression model, I found SGD can but RF can't. > > Is about the difference of algo? I thought it's able to partial_fit because gradient descent, or just another reason? > > thx > _________________________ From pahome.chen at mirlab.org Sun Mar 17 22:10:03 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 18 Mar 2019 10:10:03 +0800 Subject: [scikit-learn] Any model can predict multiple trend from hierarchical data? Message-ID: My hierarchical data are about sell numbers of 3 hot drinks and 3 cold drinks each month. Generally, cluster them into two group which one contain hot and another contains cold is better. But I don't want to cluster. When I study about sklearn.linear_model, I found they can only predict one trend for both hot and cold pattern. The trend of hot and cold is the same. But that make sense because it's "linear" model which suitable for linear separable data. Now, I want to predict different trend for hot and cold drink with only one model. If I have many features as I can, is there any model able to predict multiple patterns from hierarchical data? PS: under the condition that has no noise, data only conatin each trend of each kind of drink. thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From abdillah at buaa.edu.cn Tue Mar 19 05:44:02 2019 From: abdillah at buaa.edu.cn (=?UTF-8?B?6Zi/5biD6L+q5YWw?=) Date: Tue, 19 Mar 2019 17:44:02 +0800 (GMT+08:00) Subject: [scikit-learn] How to write the cross-project prediction algorithm Message-ID: I am new in machine learning and now I am facing an issue. I have 7 projects I would like to predict whether a pull request would be rejected or not (Yes or No). And I would like to build a prediction model by using data from 6 projects as source project and predict the rejection of the pull request in the seventh project as a target project. Can someone tell me how can I structure my algorithm in Scikit-lear? Hope that my question is clear. Thanks Best regards, Abdillah -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Sun Mar 24 22:48:07 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 25 Mar 2019 10:48:07 +0800 Subject: [scikit-learn] How to improve mse when training regression model with month-base data? Message-ID: I want to predict sold number of item in every day in month. But data is too huge, so I train it with incrementall learning ex: sklearn.neural_network.MLPRegressor I train data per 3 months ex: 1st training with data containing from Jan. to Mar. Then train from Apr. to Jun. Then I evaluate with Feb. data, and I found mse will grow up when train incrementally till Oct. to Dec. It seems catastrophic forgetting happens that prediction of Feb. is very mess but prediction of near month ex: Dec. is good because the training of Dec. is near. Tune parameters doesn't improve well because catastrophic forgetting still happen frequently. How should I improve that? Change another way or training period? Or I should split model into 12 model and one model is reponsible for a month? -------------- next part -------------- An HTML attachment was scrubbed... URL: From maxhalford25 at gmail.com Thu Mar 28 18:51:48 2019 From: maxhalford25 at gmail.com (Max Halford) Date: Thu, 28 Mar 2019 23:51:48 +0100 Subject: [scikit-learn] F1 score weirdness Message-ID: Hey everyone, I've stumbled upon an inconsistency with the F1 score and I can't seem to get around it. I have two lists y_true = [0, 1, 2, 2, 2] and y_pred = [0, 0, 2, 2, 1]. sklearn tells me that the macro-averaged F1 score is 0.488888... If I understand correctly the macro-average F1 score is the harmonic mean of the macro-average precision score and the macro-average recall score. sklearn tells me that the macro-average precision is 0.5 whilst the macro-average recall is 0.555555... If use the statistics.harmonic_mean function from Python's standard library this gives me around 0.526315. So which is correct: 0.488888 or 0.526315? I apologize in advance if I've overlooked something silly. Best regards. -- Max Halford +336 28 25 13 38 -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Mar 28 20:40:45 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 29 Mar 2019 11:40:45 +1100 Subject: [scikit-learn] F1 score weirdness In-Reply-To: References: Message-ID: No it is the macro average of the per-class f1, i.e. an arithmetic mean over harmonic means of P & R per class On Fri., 29 Mar. 2019, 9:53 am Max Halford, wrote: > Hey everyone, > > I've stumbled upon an inconsistency with the F1 score and I can't seem to > get around it. I have two lists y_true = [0, 1, 2, 2, 2] and y_pred = [0, > 0, 2, 2, 1]. sklearn tells me that the macro-averaged F1 score is > 0.488888... If I understand correctly the macro-average F1 score is the > harmonic mean of the macro-average precision score and the macro-average > recall score. sklearn tells me that the macro-average precision is 0.5 > whilst the macro-average recall is 0.555555... If use the > statistics.harmonic_mean function from Python's standard library this gives > me around 0.526315. > > So which is correct: 0.488888 or 0.526315? I apologize in advance if I've > overlooked something silly. > > Best regards. > > -- > Max Halford > +336 28 25 13 38 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Fri Mar 29 03:38:05 2019 From: pahome.chen at mirlab.org (lampahome) Date: Fri, 29 Mar 2019 15:38:05 +0800 Subject: [scikit-learn] Can cluster based on the continuous access duration of an item? Message-ID: I have data which contain access duration of each items. EX: t0~t4 is the access time duration. 1 means the item was accessed in the time duration, 0 means not. ID,t0,t1,t2,t3,t4 0,1,0,0,1 1,1,0,0,1 2,0,0,1,1 3,0,1,1,1 Can cluster the group which item will access for a continuous duration? Like above, ID=2,ID=3 are what I want. I try KMeans, DBSCAN but it seems doesn't well Is there any algo recommended? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From andt88 at hotmail.com Sun Mar 31 06:15:36 2019 From: andt88 at hotmail.com (Andreas Tosstorff) Date: Sun, 31 Mar 2019 10:15:36 +0000 Subject: [scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets Message-ID: Dear all, I am new to scikit learn so please excuse my ignorance. Using GridsearchCV I am trying to optimize a DecisionTreeRegressor. The broader I make the parameter space, the worse the scoring gets. Setting min_samples_split to range(2,10) gives me a neg_mean_squared_error of -0.04. When setting it to range(2,5) The score is -0.004. simple_tree =GridSearchCV(tree.DecisionTreeRegressor(random_state=42), n_jobs=4, param_grid={'min_samples_split': range(2, 10)}, scoring='neg_mean_squared_error', cv=10, refit='neg_mean_squared_error') simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr) I expect an equal or more positive score for a more extensive grid search compared to the less extensive one. I would really appreciate your help! Kind regards, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sun Mar 31 14:57:16 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 31 Mar 2019 13:57:16 -0500 Subject: [scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets In-Reply-To: References: Message-ID: <52CC8278-E40C-427B-9146-CEBD18E7C47A@sebastianraschka.com> Hi Andreas, the best score is determined by computing the test fold performance (I think R^2 by default) and then averaging over them. Since you chose cv=10, you have 10 test folds, and the performance is the average performance over those for choosing the best hyper parameter setting. Then, it looks like you are computing the performance manually: > simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr) on the whole training set. Instead, I would take a look at the simple_tree.best_score_ attribute after fitting. If you do Best, Sebastian > On Mar 31, 2019, at 5:15 AM, Andreas Tosstorff wrote: > > Dear all, > I am new to scikit learn so please excuse my ignorance. Using GridsearchCV I am trying to optimize a DecisionTreeRegressor. The broader I make the parameter space, the worse the scoring gets. > Setting min_samples_split to range(2,10) gives me a neg_mean_squared_error of -0.04. When setting it to range(2,5) The score is -0.004. > simple_tree =GridSearchCV(tree.DecisionTreeRegressor(random_state=42), n_jobs=4, param_grid={'min_samples_split': range(2, 10)}, scoring='neg_mean_squared_error', cv=10, refit='neg_mean_squared_error') > > simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr) > > I expect an equal or more positive score for a more extensive grid search compared to the less extensive one. > > I would really appreciate your help! > > Kind regards, > Andreas > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sun Mar 31 16:56:30 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 1 Apr 2019 07:56:30 +1100 Subject: [scikit-learn] Can cluster based on the continuous access duration of an item? In-Reply-To: References: Message-ID: When clustering it's often a good idea to think not about the algorithm used to identify clusters, but about what distance metric might capture your intuitions about similar and dissimilar points. HTH On Fri., 29 Mar. 2019, 6:39 pm lampahome, wrote: > I have data which contain access duration of each items. > > EX: t0~t4 is the access time duration. 1 means the item was accessed in > the time duration, 0 means not. > ID,t0,t1,t2,t3,t4 > 0,1,0,0,1 > 1,1,0,0,1 > 2,0,0,1,1 > 3,0,1,1,1 > > Can cluster the group which item will access for a continuous duration? > > Like above, ID=2,ID=3 are what I want. > > I try KMeans, DBSCAN but it seems doesn't well > > Is there any algo recommended? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: