From moyi.dang at gmail.com Sat Oct 1 09:34:10 2016 From: moyi.dang at gmail.com (Moyi Dang) Date: Sat, 1 Oct 2016 09:34:10 -0400 Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values? Message-ID: Hi, I'm trying to make the hashingvectorizer work for online learning. To do this, I need it to give actual token counts. The HashingVectorizer in Sci-Kit learn doesn't give token counts, but by default gives a normalized count either l1 or l2. I need the tokenized counts, so I set norm = None. However, after I do this, I'm no longer getting decimals, but I'm still getting negative numbers. It seems like the negatives can be removed by setting non_negative = True, which takes the absolute value of the values. However, I don't understand why the negatives are there in the first place, or what they mean. I'm not sure if the absolute values are corresponding to the token counts. Can someone please help explain what the HashingVectorizer is doing? How do I get the HashingVectorizer to return token counts? You can replicate my results with the following code - I'm using the 20newsgroups dataset which comes with sci-kit learn: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) from sklearn.feature_extraction.text import HashingVectorizer # produces normalized results with mean 0 and unit variance cv = HashingVectorizer(stop_words = 'english') X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces integer results both positive and negative cv = HashingVectorizer(stop_words = 'english', norm=None) X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces only positive results but not sure if they correspond to counts cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True) X_train = cv.fit_transform(twenty_train.data) print(X_train) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Sat Oct 1 10:17:40 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Sat, 1 Oct 2016 16:17:40 +0200 Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values? In-Reply-To: References: Message-ID: <57EFC584.7020608@gmail.com> On 01/10/16 15:34, Moyi Dang wrote: > However, I don't understand why the negatives are there in the first > place, or what they mean. I'm not sure if the absolute values are > corresponding to the token counts. > > Can someone please help explain what the HashingVectorizer is doing? How > do I get the HashingVectorizer to return token counts? Hi Moyi, it's a mechanism to compensate for hash collisions, see https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute values are token counts for most practical applications (if you don't have too many collisions). There will be a PR shortly to make this more consistent. From tevang3 at gmail.com Sat Oct 1 10:59:11 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 1 Oct 2016 16:59:11 +0200 Subject: [scikit-learn] suggested machine learning algorithm Message-ID: Dear scikit-learn users and developers, I have a dataset consisting of 42 observation (molnames) and 4 variables ( VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779. molname VDWAALS EEL EGB > ESURF Expr > CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 > -7.27193 > CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 > -6.8022 > CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 > -6.61742 > CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 > -6.61742 > CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 > -5.82207 > ........ I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42. I would greatly appreciate any advice! Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ericmajinglong at gmail.com Sat Oct 1 14:37:35 2016 From: ericmajinglong at gmail.com (Eric Ma) Date: Sat, 1 Oct 2016 14:37:35 -0400 Subject: [scikit-learn] suggested machine learning algorithm In-Reply-To: References: Message-ID: Hi Thomas, A number of people I've learned from have given me the following "recipe", which I hold to loosely. 1. Start with Random Forest - it should be able to give you good baseline predictive capacity. 2. Let's say you don't care about interpretability, but only care about predictive value. Keep tweaking RF parameters (use grid search + cross validation), or switch to gradient boosting. 3. Let's say you do care about interpretability. Use RF's feature_importances_ to get out the features that are important for prediction. Try linear regression on just those, may also want to try multiplying those features together to get the "interaction" product of those features. (this is using RF as a feature selection method). Beyond this, I am sure more "expert" types will be able to chime in, and also correct me if I've said anything wrong here. Cheers Eric On Sat, Oct 1, 2016 at 10:59 AM, Thomas Evangelidis wrote: > Dear scikit-learn users and developers, > > I have a dataset consisting of 42 observation (molnames) and 4 variables ( > VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model > that estimates the experimental value (Expr). I tried multivariate linear > regression using 10,000 bootstrap repeats each time using 21 observations > for training and the rest 21 for testing, but the average correlation was > only R= 0.1727 +- 0.19779. > > > molname VDWAALS EEL EGB >> ESURF Expr >> CHEMBL108457 -20.4848 -96.5826 23.4584 >> -5.4045 -7.27193 >> CHEMBL388269 -50.3860 28.9403 -51.5147 >> -6.4061 -6.8022 >> CHEMBL244078 -49.1466 -21.9869 17.7999 >> -6.4588 -6.61742 >> CHEMBL244077 -53.4365 -32.8943 34.8723 >> -7.0384 -6.61742 >> CHEMBL396772 -51.4111 -34.4904 36.0326 >> -6.5443 -5.82207 >> ........ > > > I would like your advice about what other machine learning algorithm I > could try with these data. E.g. can I make a decision tree or the > observations and variable are too few to avoid overfitting? I could > include more variables but the observations will always remain 42. > > I would greatly appreciate any advice! > > Thomas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aadral at gmail.com Sat Oct 1 14:48:44 2016 From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=) Date: Sat, 1 Oct 2016 19:48:44 +0100 Subject: [scikit-learn] suggested machine learning algorithm In-Reply-To: References: Message-ID: Hi Thomas, What quality do you have on training? There is no silver bullet, but there is quite common technique you can use to find out if you use appropriate algorithm. You can take a look at the difference between "train" and "validation" quality of learning curves ( example ). If you see big gap, then you can reduce complexity of your model to overcome overfitting (reduce interaction parameter / number of variables / iterations / ...). If you see a small gap, then you can try to increase model complexity to fit your data better. Moreover, I see you have a tiny dataset and use 50/50 split. I presume, that you will train "production" model on the whole available dataset. In that case, I suggest you to use more data for training and use almost LOO approach to better estimate your predictive quality. But, be really cautious about cross-validation as you can easily overfit your data. 2016-10-01 15:59 GMT+01:00 Thomas Evangelidis : > Dear scikit-learn users and developers, > > I have a dataset consisting of 42 observation (molnames) and 4 variables ( > VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model > that estimates the experimental value (Expr). I tried multivariate linear > regression using 10,000 bootstrap repeats each time using 21 observations > for training and the rest 21 for testing, but the average correlation was > only R= 0.1727 +- 0.19779. > > > molname VDWAALS EEL EGB >> ESURF Expr >> CHEMBL108457 -20.4848 -96.5826 23.4584 >> -5.4045 -7.27193 >> CHEMBL388269 -50.3860 28.9403 -51.5147 >> -6.4061 -6.8022 >> CHEMBL244078 -49.1466 -21.9869 17.7999 >> -6.4588 -6.61742 >> CHEMBL244077 -53.4365 -32.8943 34.8723 >> -7.0384 -6.61742 >> CHEMBL396772 -51.4111 -34.4904 36.0326 >> -6.5443 -5.82207 >> ........ > > > I would like your advice about what other machine learning algorithm I > could try with these data. E.g. can I make a decision tree or the > observations and variable are too few to avoid overfitting? I could > include more variables but the observations will always remain 42. > > I would greatly appreciate any advice! > > Thomas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sat Oct 1 15:58:39 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 1 Oct 2016 15:58:39 -0400 Subject: [scikit-learn] suggested machine learning algorithm In-Reply-To: References: Message-ID: Maybe it?s worth switching to LOOCV since you may have a bit of a pessimistic bias here due to the small training set size (in bootstrap you only have asymptotically 0.632 unique samples for training). I would try both linear and nonlinear models; instead of adding more features maybe also try to eliminate some features via L1, feature selection, or feature extraction in addition to trying different algorithms like random forests, gaussian processes, RBF kernel SVM regression, and so forth. > On Oct 1, 2016, at 10:59 AM, Thomas Evangelidis wrote: > > Dear scikit-learn users and developers, > > I have a dataset consisting of 42 observation (molnames) and 4 variables (VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779. > > > molname VDWAALS EEL EGB ESURF Expr > CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193 > CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022 > CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742 > CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742 > CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207 > ........ > > I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42. > > I would greatly appreciate any advice! > > Thomas > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sat Oct 1 18:11:42 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 2 Oct 2016 09:11:42 +1100 Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values? In-Reply-To: <57EFC584.7020608@gmail.com> References: <57EFC584.7020608@gmail.com> Message-ID: Negative values are not really there to compensate for hash collisions. It's there because that makes the hashed vector space an approximation to the full vector space under inner product. On 2 October 2016 at 00:17, Roman Yurchak wrote: > On 01/10/16 15:34, Moyi Dang wrote: > > However, I don't understand why the negatives are there in the first > > place, or what they mean. I'm not sure if the absolute values are > > corresponding to the token counts. > > > > Can someone please help explain what the HashingVectorizer is doing? How > > do I get the HashingVectorizer to return token counts? > > Hi Moyi, > > it's a mechanism to compensate for hash collisions, see > https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute > values are token counts for most practical applications (if you don't > have too many collisions). There will be a PR shortly to make this more > consistent. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jblackburne at gmail.com Sun Oct 2 02:19:53 2016 From: jblackburne at gmail.com (Jeff Blackburne) Date: Sat, 1 Oct 2016 23:19:53 -0700 Subject: [scikit-learn] Strange behavior when I add a member to a cython struct Message-ID: Hi, As part of my work on PR #4899 (categorical splits for tree-based learners), I want to add a pointer member to the Node struct in sklearn/tree/_tree.pxd. But when I do this, it causes some of the unit tests to fail in the 32-bit Appveyor (Windows) CI. (Actually, it usually causes them to hang indefinitely.) I'm testing this with the latest commit on master. The patch I'm applying is listed in full below; it's tiny. If you like, I can make a new PR to demonstrate the behavior. Does anyone know why this would happen, and only on 32-bit windows? Thanks, Jeff  diff --git a/sklearn/tree/_tree.pxd b/sklearn/tree/_tree.pxd index dbf0545..b80e7bb 100644 --- a/sklearn/tree/_tree.pxd +++ b/sklearn/tree/_tree.pxd @@ -32,6 +32,7 @@ cdef struct Node: DOUBLE_t impurity # Impurity of the node (i.e., the value of the criterion) SIZE_t n_node_samples # Number of samples at the node DOUBLE_t weighted_n_node_samples # Weighted number of samples at the node + UINT32_t *foo cdef class Tree: diff --git a/sklearn/tree/_tree.pyx b/sklearn/tree/_tree.pyx index 4e8160f..a2f8117 100644 --- a/sklearn/tree/_tree.pyx +++ b/sklearn/tree/_tree.pyx @@ -68,9 +68,9 @@ cdef SIZE_t INITIAL_STACK_SIZE = 10 # Repeat struct definition for numpy NODE_DTYPE = np.dtype({ 'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', - 'n_node_samples', 'weighted_n_node_samples'], + 'n_node_samples', 'weighted_n_node_samples', 'foo'], 'formats': [np.intp, np.intp, np.intp, np.float64, np.float64, np.intp, - np.float64], + np.float64, np.intp], 'offsets': [ &( NULL).left_child, &( NULL).right_child, @@ -78,7 +78,8 @@ NODE_DTYPE = np.dtype({ &( NULL).threshold, &( NULL).impurity, &( NULL).n_node_samples, - &( NULL).weighted_n_node_samples + &( NULL).weighted_n_node_samples, + &( NULL).foo ] })  -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Oct 2 08:23:50 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 2 Oct 2016 14:23:50 +0200 Subject: [scikit-learn] suggested machine learning algorithm In-Reply-To: References: Message-ID: On 1 October 2016 at 20:48, ??????? ????? wrote: > Hi Thomas, > > What quality do you have on training? > > There is no silver bullet, but there is quite common technique you can use > to find out if you use appropriate algorithm. You can take a look at the > difference between "train" and "validation" quality of learning curves ( > example > ). > If you see big gap, then you can reduce complexity of your model to > overcome overfitting (reduce interaction parameter / number of variables > / iterations / ...). If you see a small gap, then you can try to increase > model complexity to fit your data better. > ?? > > ?Hi ????????, the "Training examples" in the learning curves are the number of observations used for training? Don't you think my dataset is kind of small (42 observations) to use that technique? > Moreover, I see you have a tiny dataset and use 50/50 split. I presume, > that you will train "production" model on the whole available dataset. In > that case, I suggest you to use more data for training and use almost LOO > approach > to better estimate your predictive quality. But, be really cautious about > cross-validation as you can easily overfit your data. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aadral at gmail.com Sun Oct 2 18:52:39 2016 From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=) Date: Sun, 2 Oct 2016 23:52:39 +0100 Subject: [scikit-learn] suggested machine learning algorithm In-Reply-To: References: Message-ID: 2016-10-02 13:23 GMT+01:00 Thomas Evangelidis : > > > On 1 October 2016 at 20:48, ??????? ????? wrote: > >> Hi Thomas, >> >> What quality do you have on training? >> >> There is no silver bullet, but there is quite common technique you can >> use to find out if you use appropriate algorithm. You can take a look at >> the difference between "train" and "validation" quality of learning curves ( >> example >> ). >> If you see big gap, then you can reduce complexity of your model to >> overcome overfitting (reduce interaction parameter / number of variables >> / iterations / ...). If you see a small gap, then you can try to increase >> model complexity to fit your data better. >> ?? >> >> ?Hi ????????, > > the "Training examples" in the learning curves are the number of > observations used for training? Don't you think my dataset is kind of small > (42 observations) to use that technique? > Yes, it is really a tiny dataset =). You don't necessarily need to use it over number of training observations. For instance, you can have this plot over number of iterations. > > > >> Moreover, I see you have a tiny dataset and use 50/50 split. I presume, >> that you will train "production" model on the whole available dataset. >> In that case, I suggest you to use more data for training and use almost >> LOO >> approach >> to better estimate your predictive quality. But, be really cautious about >> cross-validation as you can easily overfit your data. >> >> >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Oct 3 00:05:13 2016 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 3 Oct 2016 13:05:13 +0900 Subject: [scikit-learn] ANN Scikit-learn 0.18 released In-Reply-To: <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com> <20160929052856.GA1123098@phare.normalesup.org> <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> Message-ID: Hello community, Congratulations on the release of 0.19 ! While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts! 2016-10-01 1:58 GMT+09:00 Andreas Mueller : We've got a lot in the works already for 0.19. >> >> * multiple metrics for cross validation (#7388 et al.) >> > I've done something like this in my internal model building and selection libraries. My solution has been to have -each metric object be able to explain a "distance from optimal" -a metric collection object, which can be built by either explicit instantiation or calculation using data -a pareto curve calculation object -a ranker for the points on the pareto curve, with the ability to select the N-best points. While there are certainly smarter interfaces and implementations, here is an example of one of my doctests that may help get this PR started. My apologies that my old docstring argument notation doesn't match the commonly used standards. Hope this helps, J.B. Brown Kyoto University 26 class TrialRanker(object): 27 """An object for handling the generic mechanism of selecting optimal 28 trials from a colletion of trials.""" 43 def SelectBest(self, metricSets, paretoAlg, 44 preProcessor=None): 45 """Select the best [metricSets] by using the 46 [paretoAlg] pareto selection object. Note that it is actually 47 the [paretoAlg] that specifies how many optimal [metricSets] to 48 select. 49 50 Data may be pre-processed into a form necessary for the [paretoAlg] 51 by using the [preProcessor] that is a MetricSetConverter. 52 53 Return: an EvaluatedMetricSet if [paretoAlg] selects only one 54 metric set, otherwise a list of EvaluatedMetricSet objects. 55 56 >>> from pareto.paretoDecorators import MinNormSelector 57 >>> from pareto import OriginBasePareto 58 >>> pAlg = MinNormSelector(OriginBasePareto()) 59 60 >>> from metrics.TwoClassMetrics import Accuracy, Sensitivity 61 >>> from metrics.metricSet import EvaluatedMetricSet 62 >>> met1 = EvaluatedMetricSet.BuildByExplicitValue( 63 ... [(Accuracy, 0.7), (Sensitivity, 0.9)]) 64 >>> met1.SetTitle("Example1") 65 >>> met1.associatedData = range(5) # property set/get 66 >>> met2 = EvaluatedMetricSet.BuildByExplicitValue( 67 ... [(Accuracy, 0.8), (Sensitivity, 0.6)]) 68 >>> met2.SetTitle("Example2") 69 >>> met2.SetAssociatedData("abcdef") # explicit method call 70 >>> met3 = EvaluatedMetricSet.BuildByExplicitValue( 71 ... [(Accuracy, 0.5), (Sensitivity, 0.5)]) 72 >>> met3.SetTitle("Example3") 73 >>> met3.associatedData = float 74 75 >>> from metrics.metricSet.converters import OptDistConverter 76 77 >>> ranker = TrialRanker() # pAlg selects met1 78 >>> best = ranker.SelectBest((met1,met2,met3), 79 ... pAlg, OptDistConverter()) 80 >>> best.VerboseDescription(True) 81 >>> str(best) 82 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 83 >>> best.associatedData 84 [0, 1, 2, 3, 4] 85 86 >>> pAlg = MinNormSelector(OriginBasePareto(), nSelect=2) 87 >>> best = ranker.SelectBest((met1,met2,met3), 88 ... pAlg, OptDistConverter()) 89 >>> for metSet in best: 90 ... metSet.VerboseDescription(True) 91 ... str(metSet) 92 ... str(metSet.associatedData) 93 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 94 '[0, 1, 2, 3, 4]' 95 'Example2: 2 metrics; Accuracy=0.800; Sensitivity=0.600' 96 'abcdef' 97 98 >>> from metrics.TwoClassMetrics import PositivePredictiveValue 99 >>> met4 = EvaluatedMetricSet.BuildByExplicitValue( 100 ... [(Accuracy, 0.7), (PositivePredictiveValue, 0.5)]) 101 >>> best = ranker.SelectBest((met1,met2,met3,met4), 102 ... pAlg, OptDistConverter()) 103 Traceback (most recent call last): 104 ... 105 ValueError: Metric sets contain differing Metrics. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Victor.Poughon at cnes.fr Mon Oct 3 05:21:24 2016 From: Victor.Poughon at cnes.fr (Poughon Victor) Date: Mon, 3 Oct 2016 09:21:24 +0000 Subject: [scikit-learn] sample_weight for cohen_kappa_score Message-ID: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr> Hello, I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score, in a similar way to other metrics which have this argument. Is it as simple as forwarding the weights to the confusion_matrix call? [0] If yes I'm happy to work on the pull request. In that case the other argument "weights" might be confusing but it's too late to rename it, right? Cheers, Victor Poughon [0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331 From klonuo at gmail.com Mon Oct 3 07:30:44 2016 From: klonuo at gmail.com (klo uo) Date: Mon, 3 Oct 2016 13:30:44 +0200 Subject: [scikit-learn] Generate data from trained naive bayes Message-ID: Hi, because naive bayes is a generative model, does that mean that I can somehow generate data based on trained model? For example: clf = BernoulliNB() clf.fit(train, labels) Can I generate data for specific label? Thanks, Klo -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Oct 3 09:07:56 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 3 Oct 2016 09:07:56 -0400 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: Message-ID: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Hi Klo. Yes, you could, but as the model is very simple, that's usually not very interesting. It stores for each label an independent Bernoulli distribution for each feature. these are stored in feature_log_prob_. I would suggest you look at this attribute, rather than sample from the distribution. To sample from it you would have to exponentiate it and then sample from these Bernoulli distributions. Andy On 10/03/2016 07:30 AM, klo uo wrote: > Hi, > > because naive bayes is a generative model, does that mean that I can > somehow generate data based on trained model? > > For example: > > clf = BernoulliNB() > clf.fit(train, labels) > > Can I generate data for specific label? > > > Thanks, > Klo > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Oct 3 09:09:54 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 3 Oct 2016 09:09:54 -0400 Subject: [scikit-learn] sample_weight for cohen_kappa_score In-Reply-To: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr> References: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr> Message-ID: <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com> Hm it sounds like "weights" should have been called "weighting" maybe? Not sure if it's worth changing now, as we released it already. And I think passing the weighting to the confusion matrix is correct. There should be tests for weighted metrics to confirm that. PR welcome. On 10/03/2016 05:21 AM, Poughon Victor wrote: > Hello, > > I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score, > in a similar way to other metrics which have this argument. Is it as simple as > forwarding the weights to the confusion_matrix call? [0] > > If yes I'm happy to work on the pull request. > > In that case the other argument "weights" might be confusing but it's too late > to rename it, right? > > Cheers, > > Victor Poughon > > [0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331 > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From klonuo at gmail.com Mon Oct 3 11:08:39 2016 From: klonuo at gmail.com (klo uo) Date: Mon, 3 Oct 2016 17:08:39 +0200 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: Thanks Andy, I can comprehend to the point "...and then sample from these Bernoulli distributions" >From the data in feature_log_prob_, I would guess it contains single feature (features mean from the trained data) for each class. I can see how can I sample from feature_log_prob_... On Mon, Oct 3, 2016 at 3:07 PM, Andreas Mueller wrote: > Hi Klo. > Yes, you could, but as the model is very simple, that's usually not very > interesting. > It stores for each label an independent Bernoulli distribution for each > feature. > these are stored in feature_log_prob_. > I would suggest you look at this attribute, rather than sample from the > distribution. > To sample from it you would have to exponentiate it and then sample from > these Bernoulli distributions. > > Andy > > > On 10/03/2016 07:30 AM, klo uo wrote: > > Hi, > > because naive bayes is a generative model, does that mean that I can > somehow generate data based on trained model? > > For example: > > clf = BernoulliNB() > clf.fit(train, labels) > > Can I generate data for specific label? > > > Thanks, > Klo > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From klonuo at gmail.com Mon Oct 3 11:09:32 2016 From: klonuo at gmail.com (klo uo) Date: Mon, 3 Oct 2016 17:09:32 +0200 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: On Mon, Oct 3, 2016 at 5:08 PM, klo uo wrote: > I can see how can I sample from feature_log_prob_... > I meant I cannot see -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Oct 3 11:14:15 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 3 Oct 2016 17:14:15 +0200 Subject: [scikit-learn] Welcome Raghav to the core-dev team Message-ID: <20161003151415.GF20745@phare.normalesup.org> Hi, We have the pleasure to welcome Raghav RV to the core-dev team. Raghav (@raghavrv) has been working on scikit-learn for more than a year. In particular, he implemented the rewrite of the cross-validation utilities, which is quite dear to my heart. Welcome Raghav! Ga?l From manojkumarsivaraj334 at gmail.com Mon Oct 3 11:23:55 2016 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Mon, 3 Oct 2016 11:23:55 -0400 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: Hi, feature_log_prob_ is an array of size (n_classes, n_features). exp(feature_log_prob_[class_ind, feature_ind]) gives P(X_{feature_ind} = 1 | class_ind)" Using the conditional independence assumptions of NaiveBayes, you can use this to sample each feature independently given the class. Hope that helps. On Mon, Oct 3, 2016 at 11:09 AM, klo uo wrote: > On Mon, Oct 3, 2016 at 5:08 PM, klo uo wrote: > >> I can see how can I sample from feature_log_prob_... >> > > I meant I cannot see > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Mon Oct 3 11:22:33 2016 From: nfliu at uw.edu (Nelson Liu) Date: Mon, 3 Oct 2016 08:22:33 -0700 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: <20161003151415.GF20745@phare.normalesup.org> References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Yay! Congrats, Raghav! On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > Hi, > > We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > (@raghavrv) has been working on scikit-learn for more than a year. In > particular, he implemented the rewrite of the cross-validation utilities, > which is quite dear to my heart. > > Welcome Raghav! > > Ga?l > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashu.9412 at gmail.com Mon Oct 3 11:27:40 2016 From: ashu.9412 at gmail.com (Devashish Deshpande) Date: Mon, 3 Oct 2016 20:57:40 +0530 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congratulations Raghav!! On Mon, Oct 3, 2016 at 8:52 PM, Nelson Liu wrote: > Yay! Congrats, Raghav! > > On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> Hi, >> >> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >> (@raghavrv) has been working on scikit-learn for more than a year. In >> particular, he implemented the rewrite of the cross-validation utilities, >> which is quite dear to my heart. >> >> Welcome Raghav! >> >> Ga?l >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yenchenlin1994 at gmail.com Mon Oct 3 11:28:58 2016 From: yenchenlin1994 at gmail.com (lin yenchen) Date: Mon, 03 Oct 2016 15:28:58 +0000 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats, Raghav! Nelson Liu ? 2016?10?3? ?? ??11:27??? > Yay! Congrats, Raghav! > > On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > > Hi, > > We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > (@raghavrv) has been working on scikit-learn for more than a year. In > particular, he implemented the rewrite of the cross-validation utilities, > which is quite dear to my heart. > > Welcome Raghav! > > Ga?l > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From krishnakalyan3 at gmail.com Mon Oct 3 11:39:22 2016 From: krishnakalyan3 at gmail.com (Krishna Kalyan) Date: Mon, 3 Oct 2016 17:39:22 +0200 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats Raghav. :) On Mon, Oct 3, 2016 at 5:28 PM, lin yenchen wrote: > Congrats, Raghav! > > Nelson Liu ? 2016?10?3? ?? ??11:27??? > >> Yay! Congrats, Raghav! >> >> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >> Hi, >> >> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >> (@raghavrv) has been working on scikit-learn for more than a year. In >> particular, he implemented the rewrite of the cross-validation utilities, >> which is quite dear to my heart. >> >> Welcome Raghav! >> >> Ga?l >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronnie.ghose at gmail.com Mon Oct 3 11:40:15 2016 From: ronnie.ghose at gmail.com (Ronnie Ghose) Date: Mon, 3 Oct 2016 11:40:15 -0400 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: congrats! :) On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen wrote: > Congrats, Raghav! > > Nelson Liu ? 2016?10?3? ?? ??11:27??? > >> Yay! Congrats, Raghav! >> >> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >> Hi, >> >> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >> (@raghavrv) has been working on scikit-learn for more than a year. In >> particular, he implemented the rewrite of the cross-validation utilities, >> which is quite dear to my heart. >> >> Welcome Raghav! >> >> Ga?l >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Mon Oct 3 12:09:13 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 3 Oct 2016 18:09:13 +0200 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Thanks everyone! Looking forward to contributing more :D On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose wrote: > congrats! :) > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > wrote: > >> Congrats, Raghav! >> >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >> >>> Yay! Congrats, Raghav! >>> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >>> gael.varoquaux at normalesup.org> wrote: >>> >>> Hi, >>> >>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >>> (@raghavrv) has been working on scikit-learn for more than a year. In >>> particular, he implemented the rewrite of the cross-validation utilities, >>> which is quite dear to my heart. >>> >>> Welcome Raghav! >>> >>> Ga?l >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Mon Oct 3 12:21:51 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Mon, 3 Oct 2016 09:21:51 -0700 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congratulation Raghav! On 3 October 2016 at 08:40, Ronnie Ghose wrote: > congrats! :) > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > wrote: >> >> Congrats, Raghav! >> >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >>> >>> Yay! Congrats, Raghav! >>> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux >>> wrote: >>>> >>>> Hi, >>>> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >>>> (@raghavrv) has been working on scikit-learn for more than a year. In >>>> particular, he implemented the rewrite of the cross-validation >>>> utilities, >>>> which is quite dear to my heart. >>>> >>>> Welcome Raghav! >>>> >>>> Ga?l >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From ragvrv at gmail.com Mon Oct 3 12:23:35 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 3 Oct 2016 18:23:35 +0200 Subject: [scikit-learn] ANN Scikit-learn 0.18 released In-Reply-To: References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com> <20160929052856.GA1123098@phare.normalesup.org> <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> Message-ID: Hi Brown, Thanks for the email. There is a working PR here at https://github.com/scikit-learn/scikit-learn/pull/7388 Would you be kind to take a look at it and comment how helpful the proposed API is for your use case? Thanks On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. wrote: > Hello community, > > Congratulations on the release of 0.19 ! > While I'm merely a casual user and wish I could contribute more often, I > thank everyone for their time and efforts! > > 2016-10-01 1:58 GMT+09:00 Andreas Mueller : > > We've got a lot in the works already for 0.19. >>> >>> * multiple metrics for cross validation (#7388 et al.) >>> >> > I've done something like this in my internal model building and selection > libraries. > My solution has been to have > -each metric object be able to explain a "distance from optimal" > -a metric collection object, which can be built by either explicit > instantiation or calculation using data > -a pareto curve calculation object > -a ranker for the points on the pareto curve, with the ability to select > the N-best points. > > While there are certainly smarter interfaces and implementations, here is > an example of one of my doctests that may help get this PR started. > My apologies that my old docstring argument notation doesn't match the > commonly used standards. > > Hope this helps, > J.B. Brown > Kyoto University > > 26 class TrialRanker(object): > > 27 """An object for handling the generic mechanism of selecting > optimal > 28 trials from a colletion of trials.""" > > 43 def SelectBest(self, metricSets, paretoAlg, > > 44 preProcessor=None): > > 45 """Select the best [metricSets] by using > the > 46 [paretoAlg] pareto selection object. Note that it is > actually > 47 the [paretoAlg] that specifies how many optimal [metricSets] > to > 48 select. > > 49 > > 50 Data may be pre-processed into a form necessary for the > [paretoAlg] > 51 by using the [preProcessor] that is a > MetricSetConverter. > 52 > > 53 Return: an EvaluatedMetricSet if [paretoAlg] selects only > one > 54 metric set, otherwise a list of EvaluatedMetricSet > objects. > 55 > > 56 >>> from pareto.paretoDecorators import > MinNormSelector > 57 >>> from pareto import OriginBasePareto > > 58 >>> pAlg = MinNormSelector(OriginBasePare > to()) > 59 > > 60 >>> from metrics.TwoClassMetrics import Accuracy, > Sensitivity > 61 >>> from metrics.metricSet import > EvaluatedMetricSet > 62 >>> met1 = EvaluatedMetricSet.BuildByExpl > icitValue( > 63 ... [(Accuracy, 0.7), (Sensitivity, > 0.9)]) > 64 >>> met1.SetTitle("Example1") > > 65 >>> met1.associatedData = range(5) # property > set/get > 66 >>> met2 = EvaluatedMetricSet.BuildByExpl > icitValue( > 67 ... [(Accuracy, 0.8), (Sensitivity, > 0.6)]) > 68 >>> met2.SetTitle("Example2") > > 69 >>> met2.SetAssociatedData("abcdef") # explicit method > call > 70 >>> met3 = EvaluatedMetricSet.BuildByExpl > icitValue( > 71 ... [(Accuracy, 0.5), (Sensitivity, > 0.5)]) > 72 >>> met3.SetTitle("Example3") > > 73 >>> met3.associatedData = float > > 74 > > 75 >>> from metrics.metricSet.converters import > OptDistConverter > 76 > > 77 >>> ranker = TrialRanker() # pAlg selects > met1 > 78 >>> best = ranker.SelectBest((met1,met2,m > et3), > 79 ... pAlg, > OptDistConverter()) > 80 >>> best.VerboseDescription(True) > > 81 >>> str(best) > > 82 'Example1: 2 metrics; Accuracy=0.700; > Sensitivity=0.900' > 83 >>> best.associatedData > > 84 [0, 1, 2, 3, 4] > > 85 > > 86 >>> pAlg = MinNormSelector(OriginBasePareto(), > nSelect=2) > 87 >>> best = ranker.SelectBest((met1,met2,m > et3), > 88 ... pAlg, > OptDistConverter()) > 89 >>> for metSet in best: > > 90 ... metSet.VerboseDescription(True > ) > 91 ... str(metSet) > > 92 ... str(metSet.associatedData) > > 93 'Example1: 2 metrics; Accuracy=0.700; > Sensitivity=0.900' > 94 '[0, 1, 2, 3, 4]' > > 95 'Example2: 2 metrics; Accuracy=0.800; > Sensitivity=0.600' > 96 'abcdef' > > 97 > > 98 >>> from metrics.TwoClassMetrics import > PositivePredictiveValue > 99 >>> met4 = EvaluatedMetricSet.BuildByExpl > icitValue( > 100 ... [(Accuracy, 0.7), (PositivePredictiveValue, > 0.5)]) > 101 >>> best = ranker.SelectBest((met1,met2,m > et3,met4), > 102 ... pAlg, > OptDistConverter()) > 103 Traceback (most recent call last): > > 104 ... > > 105 ValueError: Metric sets contain differing > Metrics. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manojkumarsivaraj334 at gmail.com Mon Oct 3 12:24:05 2016 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Mon, 3 Oct 2016 12:24:05 -0400 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congratulations! On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux wrote: > Congratulation Raghav! > > On 3 October 2016 at 08:40, Ronnie Ghose wrote: > > congrats! :) > > > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > > wrote: > >> > >> Congrats, Raghav! > >> > >> Nelson Liu ? 2016?10?3? ?? ??11:27??? > >>> > >>> Yay! Congrats, Raghav! > >>> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > >>>> (@raghavrv) has been working on scikit-learn for more than a year. In > >>>> particular, he implemented the rewrite of the cross-validation > >>>> utilities, > >>>> which is quite dear to my heart. > >>>> > >>>> Welcome Raghav! > >>>> > >>>> Ga?l > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Manoj, http://github.com/MechCoder -------------- next part -------------- An HTML attachment was scrubbed... URL: From aakash at klugtek.co.in Mon Oct 3 12:48:05 2016 From: aakash at klugtek.co.in (Aakash Agarwal) Date: Mon, 3 Oct 2016 22:18:05 +0530 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats Raghav! On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar wrote: > Congratulations! > > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux < > nelle.varoquaux at gmail.com> wrote: > >> Congratulation Raghav! >> >> On 3 October 2016 at 08:40, Ronnie Ghose wrote: >> > congrats! :) >> > >> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen >> > wrote: >> >> >> >> Congrats, Raghav! >> >> >> >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >> >>> >> >>> Yay! Congrats, Raghav! >> >>> >> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. >> Raghav >> >>>> (@raghavrv) has been working on scikit-learn for more than a year. In >> >>>> particular, he implemented the rewrite of the cross-validation >> >>>> utilities, >> >>>> which is quite dear to my heart. >> >>>> >> >>>> Welcome Raghav! >> >>>> >> >>>> Ga?l >> >>>> >> >>>> _______________________________________________ >> >>>> scikit-learn mailing list >> >>>> scikit-learn at python.org >> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Thanks, Aakash -------------- next part -------------- An HTML attachment was scrubbed... URL: From siddharthgupta234 at gmail.com Mon Oct 3 12:53:19 2016 From: siddharthgupta234 at gmail.com (Siddharth Gupta) Date: Mon, 3 Oct 2016 22:23:19 +0530 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats Raghav! :D On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: > Congrats Raghav! > > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar com> wrote: > >> Congratulations! >> >> On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux < >> nelle.varoquaux at gmail.com> wrote: >> >>> Congratulation Raghav! >>> >>> On 3 October 2016 at 08:40, Ronnie Ghose wrote: >>> > congrats! :) >>> > >>> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen >> > >>> > wrote: >>> >> >>> >> Congrats, Raghav! >>> >> >>> >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >>> >>> >>> >>> Yay! Congrats, Raghav! >>> >>> >>> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux >>> >>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. >>> Raghav >>> >>>> (@raghavrv) has been working on scikit-learn for more than a year. >>> In >>> >>>> particular, he implemented the rewrite of the cross-validation >>> >>>> utilities, >>> >>>> which is quite dear to my heart. >>> >>>> >>> >>>> Welcome Raghav! >>> >>>> >>> >>>> Ga?l >>> >>>> >>> >>>> _______________________________________________ >>> >>>> scikit-learn mailing list >>> >>>> scikit-learn at python.org >>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> >> >>> >> _______________________________________________ >>> >> scikit-learn mailing list >>> >> scikit-learn at python.org >>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> > >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> >> -- >> Manoj, >> http://github.com/MechCoder >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Thanks, > Aakash > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 13:06:59 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 13:06:59 -0400 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats Raghav! And thanks a lot for all the great work on the model_selection module! > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta wrote: > > Congrats Raghav! :D > > > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: > Congrats Raghav! > > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar wrote: > Congratulations! > > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux wrote: > Congratulation Raghav! > > On 3 October 2016 at 08:40, Ronnie Ghose wrote: > > congrats! :) > > > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > > wrote: > >> > >> Congrats, Raghav! > >> > >> Nelson Liu ? 2016?10?3? ?? ??11:27??? > >>> > >>> Yay! Congrats, Raghav! > >>> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > >>>> (@raghavrv) has been working on scikit-learn for more than a year. In > >>>> particular, he implemented the rewrite of the cross-validation > >>>> utilities, > >>>> which is quite dear to my heart. > >>>> > >>>> Welcome Raghav! > >>>> > >>>> Ga?l > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Thanks, > Aakash > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jmschreiber91 at gmail.com Mon Oct 3 13:32:30 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 3 Oct 2016 10:32:30 -0700 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Congrats Raghav! On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka wrote: > Congrats Raghav! And thanks a lot for all the great work on the > model_selection module! > > > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta < > siddharthgupta234 at gmail.com> wrote: > > > > Congrats Raghav! :D > > > > > > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: > > Congrats Raghav! > > > > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar com> wrote: > > Congratulations! > > > > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux < > nelle.varoquaux at gmail.com> wrote: > > Congratulation Raghav! > > > > On 3 October 2016 at 08:40, Ronnie Ghose wrote: > > > congrats! :) > > > > > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > > > > wrote: > > >> > > >> Congrats, Raghav! > > >> > > >> Nelson Liu ? 2016?10?3? ?? ??11:27??? > > >>> > > >>> Yay! Congrats, Raghav! > > >>> > > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux > > >>> wrote: > > >>>> > > >>>> Hi, > > >>>> > > >>>> We have the pleasure to welcome Raghav RV to the core-dev team. > Raghav > > >>>> (@raghavrv) has been working on scikit-learn for more than a year. > In > > >>>> particular, he implemented the rewrite of the cross-validation > > >>>> utilities, > > >>>> which is quite dear to my heart. > > >>>> > > >>>> Welcome Raghav! > > >>>> > > >>>> Ga?l > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > Manoj, > > http://github.com/MechCoder > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > -- > > Thanks, > > Aakash > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From klonuo at gmail.com Mon Oct 3 13:45:29 2016 From: klonuo at gmail.com (klo uo) Date: Mon, 3 Oct 2016 19:45:29 +0200 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: Hi Manoj, thanks for your reply. Sorry to say, but I don't understand how to generate new feature. In this example I have X with shape (1000, 64) with 5 unique classes. feature_log_prob_ has shape (5, 64) I can generate for example uniform data with r = np.random.rand(64) Now how can I generate new features, having trained classifier? On Mon, Oct 3, 2016 at 5:23 PM, Manoj Kumar wrote: > Hi, > > feature_log_prob_ is an array of size (n_classes, n_features). > > exp(feature_log_prob_[class_ind, feature_ind]) gives P(X_{feature_ind} = > 1 | class_ind)" > > Using the conditional independence assumptions of NaiveBayes, you can use > this to sample each feature independently given the class. > > Hope that helps. > > > > > On Mon, Oct 3, 2016 at 11:09 AM, klo uo wrote: > >> On Mon, Oct 3, 2016 at 5:08 PM, klo uo wrote: >> >>> I can see how can I sample from feature_log_prob_... >>> >> >> I meant I cannot see >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Manoj, > http://github.com/MechCoder > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manojkumarsivaraj334 at gmail.com Mon Oct 3 14:20:09 2016 From: manojkumarsivaraj334 at gmail.com (Manoj Kumar) Date: Mon, 3 Oct 2016 14:20:09 -0400 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: Let's say you would like to generate just the first feature of 1000 samples with label 0. The distribution of the first feature conditioned on label 1 follows a Bernoulli distribution (as suggested by the name) with parameter "exp(feature_log_prob_[0, 0])". You could then generate the first feature of these 1000 samples by just doing first_feature = bernoulli.rvs(exp(feature_log_prob_[0, 0]), size=1000) And follow the same approach for all the other features with the corresponding parameters. (They are conditionally independent) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Mon Oct 3 14:25:51 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Mon, 3 Oct 2016 23:55:51 +0530 Subject: [scikit-learn] Random Forest with Bootstrapping Message-ID: Dear Developers, >From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. (Note: Please do correct me if I am not making any sense.) RandomForestClassifier has an option of 'bootstrap'. The API states the following > The sub-sample size is always the same as the original input sample size > but the samples are drawn with replacement if bootstrap=True (default). Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) Help this mere mortal. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 14:32:52 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 14:32:52 -0400 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: Message-ID: > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. Yes, that should be correct! > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). Best, Sebastian > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn wrote: > > Dear Developers, > > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > (Note: Please do correct me if I am not making any sense.) > > RandomForestClassifier has an option of 'bootstrap'. The API states the following > > The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). > > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > Help this mere mortal. > > Thanks > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From aadral at gmail.com Mon Oct 3 14:34:04 2016 From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=) Date: Mon, 3 Oct 2016 19:34:04 +0100 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: Message-ID: Hi, >From docs http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html : The RandomForestClassifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations z_i = (x_i, y_i). The out-of-bag (OOB) error is the average error for each z_i calculated using predictions from the trees that do not contain z_i in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained [1]. If you get samples with replacements, then you have a high chance for some of z_i not to be included in the training phase of a tree. Then this tree will be involved in estimation of OOB error for z_i. I hope it makes a little bit clearer. 2016-10-03 19:25 GMT+01:00 Ibrahim Dalal via scikit-learn < scikit-learn at python.org>: > Dear Developers, > > From whatever little knowledge I gained last night about Random Forests, > each tree is trained with a sub-sample of original dataset (usually with > replacement)?. > > (Note: Please do correct me if I am not making any sense.) > > RandomForestClassifier has an option of 'bootstrap'. The API states the > following > > >> The sub-sample size is always the same as the original input sample size >> but the samples are drawn with replacement if bootstrap=True (default). > > > Now, what I am not able to understand is - if entire dataset is used to > train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > > Help this mere mortal. > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From desitter.gravity at gmail.com Mon Oct 3 14:36:11 2016 From: desitter.gravity at gmail.com (desitter.gravity at gmail.com) Date: Mon, 3 Oct 2016 11:36:11 -0700 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Excellent Raghav! Open Source Rules the World! On Mon, Oct 3, 2016 at 10:32 AM, Jacob Schreiber wrote: > Congrats Raghav! > > On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka > wrote: > >> Congrats Raghav! And thanks a lot for all the great work on the >> model_selection module! >> >> > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta < >> siddharthgupta234 at gmail.com> wrote: >> > >> > Congrats Raghav! :D >> > >> > >> > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: >> > Congrats Raghav! >> > >> > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar < >> manojkumarsivaraj334 at gmail.com> wrote: >> > Congratulations! >> > >> > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux < >> nelle.varoquaux at gmail.com> wrote: >> > Congratulation Raghav! >> > >> > On 3 October 2016 at 08:40, Ronnie Ghose >> wrote: >> > > congrats! :) >> > > >> > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen < >> yenchenlin1994 at gmail.com> >> > > wrote: >> > >> >> > >> Congrats, Raghav! >> > >> >> > >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >> > >>> >> > >>> Yay! Congrats, Raghav! >> > >>> >> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux >> > >>> wrote: >> > >>>> >> > >>>> Hi, >> > >>>> >> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team. >> Raghav >> > >>>> (@raghavrv) has been working on scikit-learn for more than a year. >> In >> > >>>> particular, he implemented the rewrite of the cross-validation >> > >>>> utilities, >> > >>>> which is quite dear to my heart. >> > >>>> >> > >>>> Welcome Raghav! >> > >>>> >> > >>>> Ga?l >> > >>>> >> > >>>> _______________________________________________ >> > >>>> scikit-learn mailing list >> > >>>> scikit-learn at python.org >> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > >>> >> > >>> >> > >>> _______________________________________________ >> > >>> scikit-learn mailing list >> > >>> scikit-learn at python.org >> > >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > >> >> > >> _______________________________________________ >> > >> scikit-learn mailing list >> > >> scikit-learn at python.org >> > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > > >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > -- >> > Manoj, >> > http://github.com/MechCoder >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > >> > -- >> > Thanks, >> > Aakash >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Mon Oct 3 15:04:03 2016 From: zephyr14 at gmail.com (Vlad Niculae) Date: Mon, 3 Oct 2016 15:04:03 -0400 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: Awesome! Congrats Raghav and thank you for all your contributions! On Mon, Oct 3, 2016 at 1:32 PM, Jacob Schreiber wrote: > Congrats Raghav! > > On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka > wrote: >> >> Congrats Raghav! And thanks a lot for all the great work on the >> model_selection module! >> >> > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta >> > wrote: >> > >> > Congrats Raghav! :D >> > >> > >> > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" wrote: >> > Congrats Raghav! >> > >> > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar >> > wrote: >> > Congratulations! >> > >> > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux >> > wrote: >> > Congratulation Raghav! >> > >> > On 3 October 2016 at 08:40, Ronnie Ghose wrote: >> > > congrats! :) >> > > >> > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen >> > > >> > > wrote: >> > >> >> > >> Congrats, Raghav! >> > >> >> > >> Nelson Liu ? 2016?10?3? ?? ??11:27??? >> > >>> >> > >>> Yay! Congrats, Raghav! >> > >>> >> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux >> > >>> wrote: >> > >>>> >> > >>>> Hi, >> > >>>> >> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team. >> > >>>> Raghav >> > >>>> (@raghavrv) has been working on scikit-learn for more than a year. >> > >>>> In >> > >>>> particular, he implemented the rewrite of the cross-validation >> > >>>> utilities, >> > >>>> which is quite dear to my heart. >> > >>>> >> > >>>> Welcome Raghav! >> > >>>> >> > >>>> Ga?l >> > >>>> >> > >>>> _______________________________________________ >> > >>>> scikit-learn mailing list >> > >>>> scikit-learn at python.org >> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > >>> >> > >>> >> > >>> _______________________________________________ >> > >>> scikit-learn mailing list >> > >>> scikit-learn at python.org >> > >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > >> >> > >> _______________________________________________ >> > >> scikit-learn mailing list >> > >> scikit-learn at python.org >> > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> > > >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > -- >> > Manoj, >> > http://github.com/MechCoder >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > >> > -- >> > Thanks, >> > Aakash >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From cs14btech11041 at iith.ac.in Mon Oct 3 15:05:55 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Tue, 4 Oct 2016 00:35:55 +0530 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: Message-ID: Hi, Thank you for the reply. Please bear with me for a while. >From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. Thanks On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka wrote: > > From whatever little knowledge I gained last night about Random Forests, > each tree is trained with a sub-sample of original dataset (usually with > replacement)?. > > Yes, that should be correct! > > > Now, what I am not able to understand is - if entire dataset is used to > train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > > If you take an n-size bootstrap sample, where n is the number of samples > in your dataset, you have asymptotically 0.632 * n unique samples in your > bootstrap set. Or in other words 0.368 * n samples are not used for growing > the respective tree (to compute the OOB). As far as I understand, the > random forest OOB score is then computed as the average OOB of each tee > (correct me if I am wrong!). > > Best, > Sebastian > > > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > > Dear Developers, > > > > From whatever little knowledge I gained last night about Random Forests, > each tree is trained with a sub-sample of original dataset (usually with > replacement)?. > > > > (Note: Please do correct me if I am not making any sense.) > > > > RandomForestClassifier has an option of 'bootstrap'. The API states the > following > > > > The sub-sample size is always the same as the original input sample size > but the samples are drawn with replacement if bootstrap=True (default). > > > > Now, what I am not able to understand is - if entire dataset is used to > train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > > > > Help this mere mortal. > > > > Thanks > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 15:15:18 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 15:15:18 -0400 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: Message-ID: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is P(not_chosen) = (1 - 1\n)^n Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) Then, you can compute the probability of a sample being chosen as P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 Best, Sebastian > On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn wrote: > > Hi, > > Thank you for the reply. Please bear with me for a while. > > From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. > > > Thanks > > On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka wrote: > > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > Yes, that should be correct! > > > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). > > Best, > Sebastian > > > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn wrote: > > > > Dear Developers, > > > > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > > > (Note: Please do correct me if I am not making any sense.) > > > > RandomForestClassifier has an option of 'bootstrap'. The API states the following > > > > The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). > > > > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > > > Help this mere mortal. > > > > Thanks > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From klonuo at gmail.com Mon Oct 3 15:18:30 2016 From: klonuo at gmail.com (klo uo) Date: Mon, 3 Oct 2016 21:18:30 +0200 Subject: [scikit-learn] Generate data from trained naive bayes In-Reply-To: References: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com> Message-ID: Great. Thanks for your time Manoj Cheers, Klo On Mon, Oct 3, 2016 at 8:20 PM, Manoj Kumar wrote: > Let's say you would like to generate just the first feature of 1000 > samples with label 0. > > The distribution of the first feature conditioned on label 1 follows a > Bernoulli distribution (as suggested by the name) with parameter > "exp(feature_log_prob_[0, 0])". You could then generate the first feature > of these 1000 samples by just doing > > first_feature = bernoulli.rvs(exp(feature_log_prob_[0, 0]), size=1000) > > And follow the same approach for all the other features with the > corresponding parameters. (They are conditionally independent) > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 15:20:06 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 15:20:06 -0400 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> Message-ID: <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via import matplotlib.pyplot as plt vs = [] for n in range(5, 201, 5): v = 1 - (1. - 1./n)**n vs.append(v) plt.plot([n for n in range(5, 201, 5)], vs, marker='o', markersize=6, alpha=0.5,) plt.xlabel('n') plt.ylabel('1 - (1 - 1/n)^n') plt.xlim([0, 210]) plt.show() > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka wrote: > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is > > P(not_chosen) = (1 - 1\n)^n > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) > > Then, you can compute the probability of a sample being chosen as > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > Best, > Sebastian > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn wrote: >> >> Hi, >> >> Thank you for the reply. Please bear with me for a while. >> >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. >> >> >> Thanks >> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka wrote: >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. >> >> Yes, that should be correct! >> >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) >> >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). >> >> Best, >> Sebastian >> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn wrote: >>> >>> Dear Developers, >>> >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. >>> >>> (Note: Please do correct me if I am not making any sense.) >>> >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following >>> >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). >>> >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) >>> >>> Help this mere mortal. >>> >>> Thanks >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Oct 3 15:25:54 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 3 Oct 2016 15:25:54 -0400 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> Message-ID: <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com> Congrats, hope to see lot's more ;) On 10/03/2016 12:09 PM, Raghav R V wrote: > Thanks everyone! Looking forward to contributing more :D > > On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose > wrote: > > congrats! :) > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > > wrote: > > Congrats, Raghav! > > Nelson Liu > ? 2016?10?3? > ?? ??11:27??? > > Yay! Congrats, Raghav! > > On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux > > wrote: > > Hi, > > We have the pleasure to welcome Raghav RV to the > core-dev team. Raghav > (@raghavrv) has been working on scikit-learn for more > than a year. In > particular, he implemented the rewrite of the > cross-validation utilities, > which is quite dear to my heart. > > Welcome Raghav! > > Ga?l > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Mon Oct 3 15:36:54 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Tue, 4 Oct 2016 01:06:54 +0530 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: Hi, That helped a lot. Thank you very much. I have one more (silly?) doubt though. Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka wrote: > Or maybe more intuitively, you can visualize this asymptotic behavior > e.g., via > > import matplotlib.pyplot as plt > > vs = [] > for n in range(5, 201, 5): > v = 1 - (1. - 1./n)**n > vs.append(v) > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > markersize=6, > alpha=0.5,) > > plt.xlabel('n') > plt.ylabel('1 - (1 - 1/n)^n') > plt.xlim([0, 210]) > plt.show() > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka > wrote: > > > > Say the probability that a given sample from a dataset of size n is > *not* drawn as a bootstrap sample is > > > > P(not_chosen) = (1 - 1\n)^n > > > > Since you have a 1/n chance to draw a particular sample (since > bootstrapping involves drawing with replacement), which you repeat n times > to get a n-sized bootstrap sample. > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) > > > > Then, you can compute the probability of a sample being chosen as > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > Best, > > Sebastian > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > >> > >> Hi, > >> > >> Thank you for the reply. Please bear with me for a while. > >> > >> From where did this number, 0.632, come? I have no background in > statistics (which appears to be the case here!). Or let me rephrase my > query: what is this bootstrap sampling all about? Searched the web, but > didn't get satisfactory results. > >> > >> > >> Thanks > >> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > >> > >> Yes, that should be correct! > >> > >>> Now, what I am not able to understand is - if entire dataset is used > to train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > >> > >> If you take an n-size bootstrap sample, where n is the number of > samples in your dataset, you have asymptotically 0.632 * n unique samples > in your bootstrap set. Or in other words 0.368 * n samples are not used for > growing the respective tree (to compute the OOB). As far as I understand, > the random forest OOB score is then computed as the average OOB of each tee > (correct me if I am wrong!). > >> > >> Best, > >> Sebastian > >> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > >>> > >>> Dear Developers, > >>> > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > >>> > >>> (Note: Please do correct me if I am not making any sense.) > >>> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states > the following > >>> > >>> The sub-sample size is always the same as the original input sample > size but the samples are drawn with replacement if bootstrap=True (default). > >>> > >>> Now, what I am not able to understand is - if entire dataset is used > to train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > >>> > >>> Help this mere mortal. > >>> > >>> Thanks > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 15:59:38 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 15:59:38 -0400 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: > Hi, > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;). Say your dataset is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points) then a bootstrap sample could be [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8] > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn wrote: > > Hi, > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka wrote: > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via > > import matplotlib.pyplot as plt > > vs = [] > for n in range(5, 201, 5): > v = 1 - (1. - 1./n)**n > vs.append(v) > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > markersize=6, > alpha=0.5,) > > plt.xlabel('n') > plt.ylabel('1 - (1 - 1/n)^n') > plt.xlim([0, 210]) > plt.show() > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka wrote: > > > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is > > > > P(not_chosen) = (1 - 1\n)^n > > > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) > > > > Then, you can compute the probability of a sample being chosen as > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > Best, > > Sebastian > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn wrote: > >> > >> Hi, > >> > >> Thank you for the reply. Please bear with me for a while. > >> > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. > >> > >> > >> Thanks > >> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka wrote: > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > >> > >> Yes, that should be correct! > >> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > >> > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). > >> > >> Best, > >> Sebastian > >> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn wrote: > >>> > >>> Dear Developers, > >>> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > >>> > >>> (Note: Please do correct me if I am not making any sense.) > >>> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following > >>> > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). > >>> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > >>> > >>> Help this mere mortal. > >>> > >>> Thanks > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From cs14btech11041 at iith.ac.in Mon Oct 3 16:03:52 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Tue, 4 Oct 2016 01:33:52 +0530 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here. On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka wrote: > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have > an original dataset of size 100. A bootstrap sample (say, B) of size 100 is > drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > Yeah, you'll definitely have duplications, that?s why (if you have an > infinitely large n) only 0.632*n samples are unique ;). > > Say your dataset is > > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of > your data points) > > then a bootstrap sample could be > > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently > [2, 3, 6, 8] > > > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have > an original dataset of size 100. A bootstrap sample (say, B) of size 100 is > drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka > wrote: > > Or maybe more intuitively, you can visualize this asymptotic behavior > e.g., via > > > > import matplotlib.pyplot as plt > > > > vs = [] > > for n in range(5, 201, 5): > > v = 1 - (1. - 1./n)**n > > vs.append(v) > > > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > > markersize=6, > > alpha=0.5,) > > > > plt.xlabel('n') > > plt.ylabel('1 - (1 - 1/n)^n') > > plt.xlim([0, 210]) > > plt.show() > > > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka > wrote: > > > > > > Say the probability that a given sample from a dataset of size n is > *not* drawn as a bootstrap sample is > > > > > > P(not_chosen) = (1 - 1\n)^n > > > > > > Since you have a 1/n chance to draw a particular sample (since > bootstrapping involves drawing with replacement), which you repeat n times > to get a n-sized bootstrap sample. > > > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large > n) > > > > > > Then, you can compute the probability of a sample being chosen as > > > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > > > Best, > > > Sebastian > > > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > >> > > >> Hi, > > >> > > >> Thank you for the reply. Please bear with me for a while. > > >> > > >> From where did this number, 0.632, come? I have no background in > statistics (which appears to be the case here!). Or let me rephrase my > query: what is this bootstrap sampling all about? Searched the web, but > didn't get satisfactory results. > > >> > > >> > > >> Thanks > > >> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > >> > > >> Yes, that should be correct! > > >> > > >>> Now, what I am not able to understand is - if entire dataset is used > to train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > > >> > > >> If you take an n-size bootstrap sample, where n is the number of > samples in your dataset, you have asymptotically 0.632 * n unique samples > in your bootstrap set. Or in other words 0.368 * n samples are not used for > growing the respective tree (to compute the OOB). As far as I understand, > the random forest OOB score is then computed as the average OOB of each tee > (correct me if I am wrong!). > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > >>> > > >>> Dear Developers, > > >>> > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > >>> > > >>> (Note: Please do correct me if I am not making any sense.) > > >>> > > >>> RandomForestClassifier has an option of 'bootstrap'. The API states > the following > > >>> > > >>> The sub-sample size is always the same as the original input sample > size but the samples are drawn with replacement if bootstrap=True (default). > > >>> > > >>> Now, what I am not able to understand is - if entire dataset is used > to train each of the trees, then how does the classifier estimates the OOB > error? None of the entries of the dataset is an oob for any of the trees. > (Pardon me if all this sounds BS) > > >>> > > >>> Help this mere mortal. > > >>> > > >>> Thanks > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Oct 3 16:28:36 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 3 Oct 2016 16:28:36 -0400 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement. For more details, I?d recommend reading the original literature, e.g,. Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.? The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26. There?s also a whole book on this topic: Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall. Or more relevant to this particular application, maybe see Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140. "Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy." > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn wrote: > > So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here. > > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka wrote: > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? > > Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;). > > Say your dataset is > > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points) > > then a bootstrap sample could be > > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8] > > > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn wrote: > > > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? > > > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka wrote: > > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via > > > > import matplotlib.pyplot as plt > > > > vs = [] > > for n in range(5, 201, 5): > > v = 1 - (1. - 1./n)**n > > vs.append(v) > > > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > > markersize=6, > > alpha=0.5,) > > > > plt.xlabel('n') > > plt.ylabel('1 - (1 - 1/n)^n') > > plt.xlim([0, 210]) > > plt.show() > > > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka wrote: > > > > > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is > > > > > > P(not_chosen) = (1 - 1\n)^n > > > > > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. > > > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) > > > > > > Then, you can compute the probability of a sample being chosen as > > > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > > > Best, > > > Sebastian > > > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn wrote: > > >> > > >> Hi, > > >> > > >> Thank you for the reply. Please bear with me for a while. > > >> > > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. > > >> > > >> > > >> Thanks > > >> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka wrote: > > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > >> > > >> Yes, that should be correct! > > >> > > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > >> > > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn wrote: > > >>> > > >>> Dear Developers, > > >>> > > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > >>> > > >>> (Note: Please do correct me if I am not making any sense.) > > >>> > > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following > > >>> > > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). > > >>> > > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > >>> > > >>> Help this mere mortal. > > >>> > > >>> Thanks > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jaquesgrobler at gmail.com Tue Oct 4 04:14:54 2016 From: jaquesgrobler at gmail.com (Jaques Grobler) Date: Tue, 4 Oct 2016 10:14:54 +0200 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com> References: <20161003151415.GF20745@phare.normalesup.org> <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com> Message-ID: Congrats Raghav! 2016-10-03 21:25 GMT+02:00 Andreas Mueller : > Congrats, hope to see lot's more ;) > > > On 10/03/2016 12:09 PM, Raghav R V wrote: > > Thanks everyone! Looking forward to contributing more :D > > On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose > wrote: > >> congrats! :) >> >> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen >> wrote: >> >>> Congrats, Raghav! >>> >>> Nelson Liu ? 2016?10?3? ?? ??11:27??? >>> >>>> Yay! Congrats, Raghav! >>>> >>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >>>> gael.varoquaux at normalesup.org> wrote: >>>> >>>> Hi, >>>> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >>>> (@raghavrv) has been working on scikit-learn for more than a year. In >>>> particular, he implemented the rewrite of the cross-validation >>>> utilities, >>>> which is quite dear to my heart. >>>> >>>> Welcome Raghav! >>>> >>>> Ga?l >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Victor.Poughon at cnes.fr Tue Oct 4 05:10:04 2016 From: Victor.Poughon at cnes.fr (Poughon Victor) Date: Tue, 4 Oct 2016 09:10:04 +0000 Subject: [scikit-learn] sample_weight for cohen_kappa_score In-Reply-To: <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com> References: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr>, <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com> Message-ID: <3E55146A6A81B44A9CB69CAB65908CEA3558E06E@TW-MBX-P01.cnesnet.ad.cnes.fr> I had a go at a PR (with a caveat for testing): https://github.com/scikit-learn/scikit-learn/pull/7569 Victor Poughon ________________________________________ De : scikit-learn [scikit-learn-bounces+victor.poughon=cnes.fr at python.org] de la part de Andreas Mueller [t3kcit at gmail.com] Envoy? : lundi 3 octobre 2016 15:09 ? : Scikit-learn user and developer mailing list Objet : Re: [scikit-learn] sample_weight for cohen_kappa_score Hm it sounds like "weights" should have been called "weighting" maybe? Not sure if it's worth changing now, as we released it already. And I think passing the weighting to the confusion matrix is correct. There should be tests for weighted metrics to confirm that. PR welcome. On 10/03/2016 05:21 AM, Poughon Victor wrote: > Hello, > > I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score, > in a similar way to other metrics which have this argument. Is it as simple as > forwarding the weights to the confusion_matrix call? [0] > > If yes I'm happy to work on the pull request. > > In that case the other argument "weights" might be confusing but it's too late > to rename it, right? > > Cheers, > > Victor Poughon > > [0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331 > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Tue Oct 4 06:43:02 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 4 Oct 2016 21:43:02 +1100 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com> Message-ID: Congratulations, Raghav! Thanks for your dedication, as a student and mentor in GSoC, but at all other times too! On 4 October 2016 at 19:14, Jaques Grobler wrote: > Congrats Raghav! > > 2016-10-03 21:25 GMT+02:00 Andreas Mueller : > >> Congrats, hope to see lot's more ;) >> >> >> On 10/03/2016 12:09 PM, Raghav R V wrote: >> >> Thanks everyone! Looking forward to contributing more :D >> >> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose >> wrote: >> >>> congrats! :) >>> >>> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen >>> wrote: >>> >>>> Congrats, Raghav! >>>> >>>> Nelson Liu ? 2016?10?3? ?? ??11:27??? >>>> >>>>> Yay! Congrats, Raghav! >>>>> >>>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < >>>>> gael.varoquaux at normalesup.org> wrote: >>>>> >>>>> Hi, >>>>> >>>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav >>>>> (@raghavrv) has been working on scikit-learn for more than a year. In >>>>> particular, he implemented the rewrite of the cross-validation >>>>> utilities, >>>>> which is quite dear to my heart. >>>>> >>>>> Welcome Raghav! >>>>> >>>>> Ga?l >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cs14btech11041 at iith.ac.in Tue Oct 4 06:44:06 2016 From: cs14btech11041 at iith.ac.in (Ibrahim Dalal) Date: Tue, 4 Oct 2016 16:14:06 +0530 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: Hi, So why is using a bootstrap sample of size n better than just a random set of size 0.62*n in Random Forest? Thanks On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka wrote: > Originally, it was this technique was used to estimate a sampling > distribution. Think of the drawing with replacement as work-around for > generating *new* data from a population that is simulated by this repeated > sampling from the given dataset with replacement. > > > For more details, I?d recommend reading the original literature, e.g,. > > Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.? > The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26. > > > There?s also a whole book on this topic: > > Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the > Bootstrap. Chapman & Hall. > > > Or more relevant to this particular application, maybe see > > Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140. > > "Tests on real and simulated data sets using classification and regression > trees and subset selection in linear regression show that bagging can give > substantial gains in accuracy. The vital element is the instability of the > prediction method. If perturbing the learning set can cause significant > changes in the predictor constructed, then bagging can improve accuracy." > > > > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > > So what is the point of having duplicate entries in your training set? > This seems just a pure overhead. Sorry but you will again have to help me > here. > > > > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka > wrote: > > > Hi, > > > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we > have an original dataset of size 100. A bootstrap sample (say, B) of size > 100 is drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > > > Yeah, you'll definitely have duplications, that?s why (if you have an > infinitely large n) only 0.632*n samples are unique ;). > > > > Say your dataset is > > > > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices > of your data points) > > > > then a bootstrap sample could be > > > > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently > [2, 3, 6, 8] > > > > > > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > > > > Hi, > > > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we > have an original dataset of size 100. A bootstrap sample (say, B) of size > 100 is drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > > > > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > > Or maybe more intuitively, you can visualize this asymptotic behavior > e.g., via > > > > > > import matplotlib.pyplot as plt > > > > > > vs = [] > > > for n in range(5, 201, 5): > > > v = 1 - (1. - 1./n)**n > > > vs.append(v) > > > > > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > > > markersize=6, > > > alpha=0.5,) > > > > > > plt.xlabel('n') > > > plt.ylabel('1 - (1 - 1/n)^n') > > > plt.xlim([0, 210]) > > > plt.show() > > > > > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka > wrote: > > > > > > > > Say the probability that a given sample from a dataset of size n is > *not* drawn as a bootstrap sample is > > > > > > > > P(not_chosen) = (1 - 1\n)^n > > > > > > > > Since you have a 1/n chance to draw a particular sample (since > bootstrapping involves drawing with replacement), which you repeat n times > to get a n-sized bootstrap sample. > > > > > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very > large n) > > > > > > > > Then, you can compute the probability of a sample being chosen as > > > > > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > > > > > Best, > > > > Sebastian > > > > > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > >> > > > >> Hi, > > > >> > > > >> Thank you for the reply. Please bear with me for a while. > > > >> > > > >> From where did this number, 0.632, come? I have no background in > statistics (which appears to be the case here!). Or let me rephrase my > query: what is this bootstrap sampling all about? Searched the web, but > didn't get satisfactory results. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > > >> > > > >> Yes, that should be correct! > > > >> > > > >>> Now, what I am not able to understand is - if entire dataset is > used to train each of the trees, then how does the classifier estimates the > OOB error? None of the entries of the dataset is an oob for any of the > trees. (Pardon me if all this sounds BS) > > > >> > > > >> If you take an n-size bootstrap sample, where n is the number of > samples in your dataset, you have asymptotically 0.632 * n unique samples > in your bootstrap set. Or in other words 0.368 * n samples are not used for > growing the respective tree (to compute the OOB). As far as I understand, > the random forest OOB score is then computed as the average OOB of each tee > (correct me if I am wrong!). > > > >> > > > >> Best, > > > >> Sebastian > > > >> > > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn < > scikit-learn at python.org> wrote: > > > >>> > > > >>> Dear Developers, > > > >>> > > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > > >>> > > > >>> (Note: Please do correct me if I am not making any sense.) > > > >>> > > > >>> RandomForestClassifier has an option of 'bootstrap'. The API > states the following > > > >>> > > > >>> The sub-sample size is always the same as the original input > sample size but the samples are drawn with replacement if bootstrap=True > (default). > > > >>> > > > >>> Now, what I am not able to understand is - if entire dataset is > used to train each of the trees, then how does the classifier estimates the > OOB error? None of the entries of the dataset is an oob for any of the > trees. (Pardon me if all this sounds BS) > > > >>> > > > >>> Help this mere mortal. > > > >>> > > > >>> Thanks > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> scikit-learn at python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn at python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn at python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Tue Oct 4 08:15:41 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Tue, 4 Oct 2016 12:15:41 +0000 Subject: [scikit-learn] Random Forest with Bootstrapping In-Reply-To: References: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com> <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com> Message-ID: Search for Jackknife at Wikipedia. That will give you a quick overview. Then you will have the background to read the papers below. While you are at Wikipedia, you may want to read on the bootstrap and random forests as well. __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Ibrahim Dalal via scikit-learn Sent: Tuesday, October 4, 2016 6:44 AM To: Scikit-learn user and developer mailing list Cc: Ibrahim Dalal Subject: Re: [scikit-learn] Random Forest with Bootstrapping ? EXT MSG: Hi, So why is using a bootstrap sample of size n better than just a random set of size 0.62*n in Random Forest? Thanks On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka > wrote: Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement. For more details, I?d recommend reading the original literature, e.g,. Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.? The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26. There?s also a whole book on this topic: Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall. Or more relevant to this particular application, maybe see Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140. "Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy." > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn > wrote: > > So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here. > > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka > wrote: > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? > > Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;). > > Say your dataset is > > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points) > > then a bootstrap sample could be > > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8] > > > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn > wrote: > > > > Hi, > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt though. > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated? > > > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka > wrote: > > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via > > > > import matplotlib.pyplot as plt > > > > vs = [] > > for n in range(5, 201, 5): > > v = 1 - (1. - 1./n)**n > > vs.append(v) > > > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > > markersize=6, > > alpha=0.5,) > > > > plt.xlabel('n') > > plt.ylabel('1 - (1 - 1/n)^n') > > plt.xlim([0, 210]) > > plt.show() > > > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka > wrote: > > > > > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is > > > > > > P(not_chosen) = (1 - 1\n)^n > > > > > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. > > > > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n) > > > > > > Then, you can compute the probability of a sample being chosen as > > > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > > > Best, > > > Sebastian > > > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn > wrote: > > >> > > >> Hi, > > >> > > >> Thank you for the reply. Please bear with me for a while. > > >> > > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results. > > >> > > >> > > >> Thanks > > >> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka > wrote: > > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > >> > > >> Yes, that should be correct! > > >> > > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > >> > > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!). > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn > wrote: > > >>> > > >>> Dear Developers, > > >>> > > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?. > > >>> > > >>> (Note: Please do correct me if I am not making any sense.) > > >>> > > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following > > >>> > > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). > > >>> > > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS) > > >>> > > >>> Help this mere mortal. > > >>> > > >>> Thanks > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From urvesh.patel11 at gmail.com Tue Oct 4 17:39:32 2016 From: urvesh.patel11 at gmail.com (urvesh patel) Date: Tue, 4 Oct 2016 14:39:32 -0700 Subject: [scikit-learn] Adding a function that Calculates Weight of Evidence and Information Value Message-ID: > > > I have been using R extensively until last few months when I started using > Python. I noticed that Python doesn't have a function to compute > information value and weight of evidence. Detailed explanation - > http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ > > I have version 0 of this concept ready and I would like to contribute to > scikit-learn so that more and more people can use it. What are the steps I > need to follow in order to do so ? > > -- > Thanking You, > > Urvesh Patel > Data Ninja > Udacity > -------------- next part -------------- An HTML attachment was scrubbed... URL: From blrstartuphire at gmail.com Wed Oct 5 05:58:03 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Wed, 5 Oct 2016 15:28:03 +0530 Subject: [scikit-learn] Identifying column names of Non-zero values Message-ID: Hi Pypers, Hope you are doing well. I am working on a project to find out the column names of non-zero values at a row level. How can this effectively done in python pandas/dataframe? For example, *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New column to be created 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7 I might have to do it on approximately million rows Regards, Sanant -------------- next part -------------- An HTML attachment was scrubbed... URL: From samo.turk at gmail.com Wed Oct 5 07:35:25 2016 From: samo.turk at gmail.com (Samo Turk) Date: Wed, 5 Oct 2016 13:35:25 +0200 Subject: [scikit-learn] Identifying column names of Non-zero values In-Reply-To: References: Message-ID: Something like this might work: def non_zero(row, columns): return list(columns[~(row == 0)]) df.apply(lambda x: non_zero(x, df.columns), axis=1) Cheers, Samo On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire wrote: > Hi Pypers, > > Hope you are doing well. > > I am working on a project to find out the column names of non-zero values > at a row level. > > How can this effectively done in python pandas/dataframe? > > > For example, > > > > > > > > > > > > > > > > *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New > column to be created > 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7 > > > > > > > > > > > I might have to do it on approximately million rows > > Regards, > Sanant > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Wed Oct 5 07:53:28 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Wed, 5 Oct 2016 13:53:28 +0200 Subject: [scikit-learn] Identifying column names of Non-zero values In-Reply-To: References: Message-ID: Hi Sanant and Samo, Even easier and faster solution: > df.columns[(df.values != 0).any(axis=0)] Or if some reason != 0 does not work for you: > df.columns[(~(df.values == 0)).any(axis=0)] ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-10-05 13:35 GMT+02:00 Samo Turk : > Something like this might work: > > def non_zero(row, columns): > return list(columns[~(row == 0)]) > > df.apply(lambda x: non_zero(x, df.columns), axis=1) > > Cheers, > Samo > > On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire > wrote: > >> Hi Pypers, >> >> Hope you are doing well. >> >> I am working on a project to find out the column names of non-zero values >> at a row level. >> >> How can this effectively done in python pandas/dataframe? >> >> >> For example, >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New >> column to be created >> 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7 >> >> >> >> >> >> >> >> >> >> >> I might have to do it on approximately million rows >> >> Regards, >> Sanant >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From blrstartuphire at gmail.com Wed Oct 5 08:13:19 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Wed, 5 Oct 2016 17:43:19 +0530 Subject: [scikit-learn] Identifying column names of Non-zero values In-Reply-To: References: Message-ID: Hi Samo, Thanks a lot. It works at a row level and I can append it a row level to the main dataframe to do further analysis. Regards, Sanant On Wed, Oct 5, 2016 at 5:05 PM, Samo Turk wrote: > Something like this might work: > > def non_zero(row, columns): > return list(columns[~(row == 0)]) > > df.apply(lambda x: non_zero(x, df.columns), axis=1) > > Cheers, > Samo > > On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire > wrote: > >> Hi Pypers, >> >> Hope you are doing well. >> >> I am working on a project to find out the column names of non-zero values >> at a row level. >> >> How can this effectively done in python pandas/dataframe? >> >> >> For example, >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New >> column to be created >> 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7 >> >> >> >> >> >> >> >> >> >> >> I might have to do it on approximately million rows >> >> Regards, >> Sanant >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jiri.borovec at fel.cvut.cz Wed Oct 5 09:13:45 2016 From: jiri.borovec at fel.cvut.cz (=?UTF-8?B?SmnFmcOtIEJvcm92ZWM=?=) Date: Wed, 5 Oct 2016 15:13:45 +0200 Subject: [scikit-learn] wrapper for GraphCut or GridCut Message-ID: Hello, I was thinking about adding GraphCut ( http://www.csd.uwo.ca/~yuri/Papers/pami01.pdf) of GridCut ( http://www.gridcut.com/) which both of them are already implemented in C/C++ a some of then have also wrapper in Python. What is the statement to this task, having GraphCut included in this library such that using thier C/C++ code and include wrappers. Thanks -- Best regards, Jiri Borovec ------------------------------------------------------------------------ Ing. Jiri Borovec, MSc PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3 -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Oct 5 11:08:56 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 5 Oct 2016 11:08:56 -0400 Subject: [scikit-learn] wrapper for GraphCut or GridCut In-Reply-To: References: Message-ID: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com> Hi Jiri. I think both are better suited for scikit-image. I think Emanuelle there is actually working on graph cut right now. I'd ask on the scikit-image mailing list what the current status is. Best, Andy On 10/05/2016 09:13 AM, Ji?? Borovec wrote: > Hello, > I was thinking about adding GraphCut > (http://www.csd.uwo.ca/~yuri/Papers/pami01.pdf > ) of GridCut > (http://www.gridcut.com/) which both of them are already implemented > in C/C++ a some of then have also wrapper in Python. What is the > statement to this task, having GraphCut included in this library such > that using thier C/C++ code and include wrappers. > > Thanks > -- > Best regards, Jiri Borovec > ------------------------------------------------------------------------ > Ing. Jiri Borovec, MSc > > PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3 > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jiri.borovec at fel.cvut.cz Wed Oct 5 11:19:39 2016 From: jiri.borovec at fel.cvut.cz (=?UTF-8?B?SmnFmcOtIEJvcm92ZWM=?=) Date: Wed, 5 Oct 2016 17:19:39 +0200 Subject: [scikit-learn] wrapper for GraphCut or GridCut In-Reply-To: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com> References: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com> Message-ID: Hello, for the regular graph and GridCut ( https://github.com/willemolding/gridcut-python), meaning regular grid like image it would be better have it in skimage, but talking about general graph, I would keep in sklearn. I think that you already have a wrapper for GraphCut ( https://github.com/amueller/gco_python) even I found this ( https://github.com/yujiali/pygco) better one. -- Best regards, Jiri Borovec ------------------------------------------------------------------------ Ing. Jiri Borovec, MSc PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3 On 5 October 2016 at 17:08, Andreas Mueller wrote: > Hi Jiri. > I think both are better suited for scikit-image. > I think Emanuelle there is actually working on graph cut right now. > I'd ask on the scikit-image mailing list what the current status is. > > Best, > Andy > > On 10/05/2016 09:13 AM, Ji?? Borovec wrote: > > Hello, > I was thinking about adding GraphCut (http://www.csd.uwo.ca/~yuri/ > Papers/pami01.pdf) of GridCut (http://www.gridcut.com/) which both of > them are already implemented in C/C++ a some of then have also wrapper in > Python. What is the statement to this task, having GraphCut included in > this library such that using thier C/C++ code and include wrappers. > > Thanks > -- > Best regards, Jiri Borovec > ------------------------------------------------------------------------ > Ing. Jiri Borovec, MSc > PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3 > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Oct 5 11:19:36 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 5 Oct 2016 11:19:36 -0400 Subject: [scikit-learn] Adding a function that Calculates Weight of Evidence and Information Value In-Reply-To: References: Message-ID: <458656da-494a-eed0-cf38-347a146e987b@gmail.com> Hey Urvesh. That looks interesting. We recently added mutual information based feature selection. To add this to scikit-learn, we would like to see that this is an established method, for example via citations or forks or some other way. If it's only a year old (the date of the blog post) that might be a bit fresh for us, and you can add it to scikit-learn contrib. We would also like to see that there are cases when it works better than what is already established and what we have, like mutual info based selection. It looks like WOE is just the coefficient vector of Naive Bayes, right? I don't quite understand the information value at a glance, though. Andy On 10/04/2016 05:39 PM, urvesh patel wrote: > > > I have been using R extensively until last few months when I > started using Python. I noticed that Python doesn't have a > function to compute information value and weight of evidence. > Detailed explanation - > http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ > > > I have version 0 of this concept ready and I would like to > contribute to scikit-learn so that more and more people can use > it. What are the steps I need to follow in order to do so ? > > -- > Thanking You, > > Urvesh Patel > Data Ninja > Udacity > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Oct 5 11:25:24 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 5 Oct 2016 11:25:24 -0400 Subject: [scikit-learn] wrapper for GraphCut or GridCut In-Reply-To: References: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com> Message-ID: <4509db43-1673-5b18-a77d-70e7f04042ea@gmail.com> On 10/05/2016 11:19 AM, Ji?? Borovec wrote: > Hello, > for the regular graph and GridCut > (https://github.com/willemolding/gridcut-python), meaning regular grid > like image it would be better have it in skimage, but talking about > general graph, I would keep in sklearn. I disagree. Why would it be in scikit-learn? It's not a learning algorithm. It doesn't have the same interface at all. It does something pretty unrelated to machine learning. And in vision, you often have other graphs if you work with superpixels. > I think that you already have a wrapper for GraphCut > (https://github.com/amueller/gco_python) even I found this > (https://github.com/yujiali/pygco) better one. > Cool. I did the minimal port for what I needed at the time. Since I was mostly interested in learning, I switched to using QPBO. Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: From urvesh.patel11 at gmail.com Wed Oct 5 11:41:35 2016 From: urvesh.patel11 at gmail.com (urvesh patel) Date: Wed, 5 Oct 2016 08:41:35 -0700 Subject: [scikit-learn] Adding a function that Calculates Weight of Evidence and Information Value In-Reply-To: <458656da-494a-eed0-cf38-347a146e987b@gmail.com> References: <458656da-494a-eed0-cf38-347a146e987b@gmail.com> Message-ID: Hi Andreas, You are correct about weight of evidence. Information Value is a fancy term but it is very similar to mutual information. Also, this method is used most widely with uplift random forest methodology or any incremental modeling problems where the goal is to find subset of population who will contribute to ROI goal over the users who would have purchased it anyways and over the users who have negative effect because of promotion. Citations for Information Value that I found - http://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA14.pdf http://documentation.statsoft.com/STATISTICAHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview More on Uplift Random Forest or Incremental Modeling - https://www.linkedin.com/pulse/need-more-lift-try-uplift-models-jeffrey-strickland-ph-d-cmsp PS - The function I have has a special flag for uplift modeling. If this flag is set, then Information value and weight of evidence are calculated accordingly. On Wed, Oct 5, 2016 at 8:19 AM, Andreas Mueller wrote: > Hey Urvesh. > That looks interesting. We recently added mutual information based feature > selection. > To add this to scikit-learn, we would like to see that this is an > established method, for example via citations > or forks or some other way. > If it's only a year old (the date of the blog post) that might be a bit > fresh for us, and you > can add it to scikit-learn contrib. > > We would also like to see that there are cases when it works better than > what is already established > and what we have, like mutual info based selection. > > It looks like WOE is just the coefficient vector of Naive Bayes, right? > I don't quite understand the information value at a glance, though. > > Andy > > > On 10/04/2016 05:39 PM, urvesh patel wrote: > > >> I have been using R extensively until last few months when I started >> using Python. I noticed that Python doesn't have a function to compute >> information value and weight of evidence. Detailed explanation - >> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ >> >> I have version 0 of this concept ready and I would like to contribute to >> scikit-learn so that more and more people can use it. What are the steps I >> need to follow in order to do so ? >> >> -- >> Thanking You, >> >> Urvesh Patel >> Data Ninja >> Udacity >> > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Thanking You, Urvesh Patel Columbia University *Masters in Operations Research* -------------- next part -------------- An HTML attachment was scrubbed... URL: From themismavridis at gmail.com Thu Oct 6 11:45:41 2016 From: themismavridis at gmail.com (Themis Mavridis) Date: Thu, 6 Oct 2016 17:45:41 +0200 Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression Message-ID: I would like to perform out-of-core training using Bayesian Ridge Regression or ARD. Is there any plan to implement such a functionality? Thanks, Themis -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Oct 7 03:59:04 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 7 Oct 2016 09:59:04 +0200 Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression In-Reply-To: References: Message-ID: I don't think anybody is working on this but you should better check in github pull requests. Best, -- Olivier From aakash at klugtek.co.in Fri Oct 7 09:51:44 2016 From: aakash at klugtek.co.in (Aakash Agarwal) Date: Fri, 7 Oct 2016 19:21:44 +0530 Subject: [scikit-learn] MLP Classifier error in 0.18 version Message-ID: Hi Guys, I am playing around MLP classifier lately. So i have about 450 inputs to classify. Each input is a vector of array size 50. I am trying to fit the model with 90% as train data. Size of training data: (398, 50) Size of testing data: (45, 50) MLP instantiation: gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True) Batch size is auto so it is taking 200 as batch_size. But when i am fitting the classifier model, i am getting following error: Traceback (most recent call last): File "intent_detection_classifier_selection.py", line 452, in sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label) File "intent_detection_classifier_selection.py", line 77, in gen_class_fitting gen_class.fit(data,label) File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 612, in fit return self._fit(X, y, incremental=False) File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 372, in _fit intercept_grads, layer_units, incremental) File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 509, in _fit_stochastic coef_grads, intercept_grads) File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 225, in _backprop loss = LOSS_FUNCTIONS[self.loss](y, activations[-1]) File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py", line 222, in log_loss return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0] ValueError: operands could not be broadcast together with shapes (200,128) (200,125) Thanks, Aakash -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Oct 7 11:47:55 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 7 Oct 2016 11:47:55 -0400 Subject: [scikit-learn] MLP Classifier error in 0.18 version In-Reply-To: References: Message-ID: <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com> Hi. Can you provide a self-contained example to reproduce on the issue-tracker? Maybe you used warm_start=True but changed something about the dataset, like going from 125 classes to 128? This works: from sklearn.neural_network import MLPClassifier gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True) X_train = np.random.uniform(size=(398, 50)) y_train = np.random.uniform(size=398) > .5 gen_class.fit(X_train, y_train) best, Andy On 10/07/2016 09:51 AM, Aakash Agarwal wrote: > Hi Guys, > > I am playing around MLP classifier lately. So i have about 450 inputs > to classify. Each input is a vector of array size 50. I am trying to > fit the model with 90% as train data. > > Size of training data: (398, 50) > Size of testing data: (45, 50) > > MLP instantiation: > gen_class = > MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True) > > Batch size is auto so it is taking 200 as batch_size. But when i am > fitting the classifier model, i am getting following error: > > Traceback (most recent call last): > File "intent_detection_classifier_selection.py", line 452, in > sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label) > File "intent_detection_classifier_selection.py", line 77, in > gen_class_fitting > gen_class.fit(data,label) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 612, in fit > return self._fit(X, y, incremental=False) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 372, in _fit > intercept_grads, layer_units, incremental) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 509, in _fit_stochastic > coef_grads, intercept_grads) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 225, in _backprop > loss = LOSS_FUNCTIONS[self.loss](y, activations[-1]) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py", > line 222, in log_loss > return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0] > ValueError: operands could not be broadcast together with shapes > (200,128) (200,125) > > Thanks, > Aakash > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Oct 7 11:48:53 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 7 Oct 2016 11:48:53 -0400 Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression In-Reply-To: References: Message-ID: <660bc83c-5a0f-48ed-9932-19b35fa4fd17@gmail.com> I don't think anyone is working on this. I'm not sure what optimizer is best for this. Maybe EP would be interesting. On 10/07/2016 03:59 AM, Olivier Grisel wrote: > I don't think anybody is working on this but you should better check > in github pull requests. > > Best, > From aakash at klugtek.co.in Fri Oct 7 14:52:31 2016 From: aakash at klugtek.co.in (Aakash Agarwal) Date: Sat, 8 Oct 2016 00:22:31 +0530 Subject: [scikit-learn] MLP Classifier error in 0.18 version In-Reply-To: <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com> References: <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com> Message-ID: Hi Andy, Thanks for the quick reply. Basically i am randomly choosing 90% training data from the data set and fitting the classifier again and again. First few transactions are working fine but after that it is failing in between. So like you mentioned, standalone fitting is happening. But as you said, warm_start seems to be the issue. Since i was choosing data randomly, total number of labels in a single batch was not constant over multiple iterations and it could not detect new labels from the previous model and thus failed. Thanks a lot for the valuable inputs. Aakash On Fri, Oct 7, 2016 at 9:17 PM, Andreas Mueller wrote: > Hi. > Can you provide a self-contained example to reproduce on the issue-tracker? > Maybe you used warm_start=True but changed something about the dataset, > like going from 125 classes to 128? > > This works: > > from sklearn.neural_network import MLPClassifier > gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000, > learning_rate='adaptive',alpha=0.025,warm_start=True) > X_train = np.random.uniform(size=(398, 50)) > y_train = np.random.uniform(size=398) > .5 > gen_class.fit(X_train, y_train) > > best, > Andy > > > On 10/07/2016 09:51 AM, Aakash Agarwal wrote: > > Hi Guys, > > I am playing around MLP classifier lately. So i have about 450 inputs to > classify. Each input is a vector of array size 50. I am trying to fit the > model with 90% as train data. > > Size of training data: (398, 50) > Size of testing data: (45, 50) > > MLP instantiation: > gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000, > learning_rate='adaptive',alpha=0.025,warm_start=True) > > Batch size is auto so it is taking 200 as batch_size. But when i am > fitting the classifier model, i am getting following error: > > Traceback (most recent call last): > File "intent_detection_classifier_selection.py", line 452, in > sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label) > File "intent_detection_classifier_selection.py", line 77, in > gen_class_fitting > gen_class.fit(data,label) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 612, in fit > return self._fit(X, y, incremental=False) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 372, in _fit > intercept_grads, layer_units, incremental) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 509, in _fit_stochastic > coef_grads, intercept_grads) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 225, in _backprop > loss = LOSS_FUNCTIONS[self.loss](y, activations[-1]) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py", > line 222, in log_loss > return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0] > ValueError: operands could not be broadcast together with shapes (200,128) > (200,125) > > Thanks, > Aakash > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Thanks, Aakash -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Mon Oct 10 06:55:52 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 10 Oct 2016 11:55:52 +0100 Subject: [scikit-learn] Using logistic regression with count proportions data Message-ID: I am trying to perform regression where my dependent variable is constrained to be between 0 and 1. This constraint comes from the fact that it represents a count proportion. That is counts in some category divided by a total count. In the literature it seems that one common way to tackle this is to use logistic regression. However, it appears that in scikit learn logistic regression is only available as a classifier (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ) . Is that right? Is there another way to perform regression using scikit learn where the dependent variable is a count proportion? Thanks for any help. Raphael From drraph at gmail.com Mon Oct 10 07:03:28 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 10 Oct 2016 12:03:28 +0100 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: I just noticed this about the glm package in R. http://stats.stackexchange.com/a/26779/53128 " The glm function in R allows 3 ways to specify the formula for a logistic regression model. The most common is that each row of the data frame represents a single observation and the response variable is either 0 or 1 (or a factor with 2 levels, or other varibale with only 2 unique values). Another option is to use a 2 column matrix as the response variable with the first column being the counts of 'successes' and the second column being the counts of 'failures'. You can also specify the response as a proportion between 0 and 1, then specify another column as the 'weight' that gives the total number that the proportion is from (so a response of 0.3 and a weight of 10 is the same as 3 'successes' and 7 'failures')." Either of the last two options would do for me. Does scikit-learn support either of these last two options? Raphael On 10 October 2016 at 11:55, Raphael C wrote: > I am trying to perform regression where my dependent variable is > constrained to be between 0 and 1. This constraint comes from the fact > that it represents a count proportion. That is counts in some category > divided by a total count. > > In the literature it seems that one common way to tackle this is to > use logistic regression. However, it appears that in scikit learn > logistic regression is only available as a classifier > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html > ) . Is that right? > > Is there another way to perform regression using scikit learn where > the dependent variable is a count proportion? > > Thanks for any help. > > Raphael From sean.violante at gmail.com Mon Oct 10 07:08:28 2016 From: sean.violante at gmail.com (Sean Violante) Date: Mon, 10 Oct 2016 13:08:28 +0200 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: should be the sample weight function in fit http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: > I just noticed this about the glm package in R. > http://stats.stackexchange.com/a/26779/53128 > > " > The glm function in R allows 3 ways to specify the formula for a > logistic regression model. > > The most common is that each row of the data frame represents a single > observation and the response variable is either 0 or 1 (or a factor > with 2 levels, or other varibale with only 2 unique values). > > Another option is to use a 2 column matrix as the response variable > with the first column being the counts of 'successes' and the second > column being the counts of 'failures'. > > You can also specify the response as a proportion between 0 and 1, > then specify another column as the 'weight' that gives the total > number that the proportion is from (so a response of 0.3 and a weight > of 10 is the same as 3 'successes' and 7 'failures')." > > Either of the last two options would do for me. Does scikit-learn > support either of these last two options? > > Raphael > > On 10 October 2016 at 11:55, Raphael C wrote: > > I am trying to perform regression where my dependent variable is > > constrained to be between 0 and 1. This constraint comes from the fact > > that it represents a count proportion. That is counts in some category > > divided by a total count. > > > > In the literature it seems that one common way to tackle this is to > > use logistic regression. However, it appears that in scikit learn > > logistic regression is only available as a classifier > > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model. > LogisticRegression.html > > ) . Is that right? > > > > Is there another way to perform regression using scikit learn where > > the dependent variable is a count proportion? > > > > Thanks for any help. > > > > Raphael > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Mon Oct 10 07:15:17 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 10 Oct 2016 12:15:17 +0100 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: How do I use sample_weight for my use case? In my case is "y" an array of 0s and 1s and sample_weight then an array real numbers between 0 and 1 where I should make sure to set sample_weight[i]= 0 when y[i] = 0? Raphael On 10 October 2016 at 12:08, Sean Violante wrote: > should be the sample weight function in fit > > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html > > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> I just noticed this about the glm package in R. >> http://stats.stackexchange.com/a/26779/53128 >> >> " >> The glm function in R allows 3 ways to specify the formula for a >> logistic regression model. >> >> The most common is that each row of the data frame represents a single >> observation and the response variable is either 0 or 1 (or a factor >> with 2 levels, or other varibale with only 2 unique values). >> >> Another option is to use a 2 column matrix as the response variable >> with the first column being the counts of 'successes' and the second >> column being the counts of 'failures'. >> >> You can also specify the response as a proportion between 0 and 1, >> then specify another column as the 'weight' that gives the total >> number that the proportion is from (so a response of 0.3 and a weight >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> Either of the last two options would do for me. Does scikit-learn >> support either of these last two options? >> >> Raphael >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> > I am trying to perform regression where my dependent variable is >> > constrained to be between 0 and 1. This constraint comes from the fact >> > that it represents a count proportion. That is counts in some category >> > divided by a total count. >> > >> > In the literature it seems that one common way to tackle this is to >> > use logistic regression. However, it appears that in scikit learn >> > logistic regression is only available as a classifier >> > >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> > ) . Is that right? >> > >> > Is there another way to perform regression using scikit learn where >> > the dependent variable is a count proportion? >> > >> > Thanks for any help. >> > >> > Raphael >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From sean.violante at gmail.com Mon Oct 10 07:22:11 2016 From: sean.violante at gmail.com (Sean Violante) Date: Mon, 10 Oct 2016 13:22:11 +0200 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: no ( but please check !) sample weights should be the counts for the respective label (0/1) [ I am actually puzzled about the glm help file - proportions loses how often an input data 'row' was present relative to the other - though you could do this by repeating the row 'n' times] On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: > How do I use sample_weight for my use case? > > In my case is "y" an array of 0s and 1s and sample_weight then an > array real numbers between 0 and 1 where I should make sure to set > sample_weight[i]= 0 when y[i] = 0? > > Raphael > > On 10 October 2016 at 12:08, Sean Violante > wrote: > > should be the sample weight function in fit > > > > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model. > LogisticRegression.html > > > > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: > >> > >> I just noticed this about the glm package in R. > >> http://stats.stackexchange.com/a/26779/53128 > >> > >> " > >> The glm function in R allows 3 ways to specify the formula for a > >> logistic regression model. > >> > >> The most common is that each row of the data frame represents a single > >> observation and the response variable is either 0 or 1 (or a factor > >> with 2 levels, or other varibale with only 2 unique values). > >> > >> Another option is to use a 2 column matrix as the response variable > >> with the first column being the counts of 'successes' and the second > >> column being the counts of 'failures'. > >> > >> You can also specify the response as a proportion between 0 and 1, > >> then specify another column as the 'weight' that gives the total > >> number that the proportion is from (so a response of 0.3 and a weight > >> of 10 is the same as 3 'successes' and 7 'failures')." > >> > >> Either of the last two options would do for me. Does scikit-learn > >> support either of these last two options? > >> > >> Raphael > >> > >> On 10 October 2016 at 11:55, Raphael C wrote: > >> > I am trying to perform regression where my dependent variable is > >> > constrained to be between 0 and 1. This constraint comes from the fact > >> > that it represents a count proportion. That is counts in some category > >> > divided by a total count. > >> > > >> > In the literature it seems that one common way to tackle this is to > >> > use logistic regression. However, it appears that in scikit learn > >> > logistic regression is only available as a classifier > >> > > >> > (http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.LogisticRegression.html > >> > ) . Is that right? > >> > > >> > Is there another way to perform regression using scikit learn where > >> > the dependent variable is a count proportion? > >> > > >> > Thanks for any help. > >> > > >> > Raphael > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Mon Oct 10 09:48:32 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 10 Oct 2016 14:48:32 +0100 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: On 10 October 2016 at 12:22, Sean Violante wrote: > no ( but please check !) > > sample weights should be the counts for the respective label (0/1) > > [ I am actually puzzled about the glm help file - proportions loses how > often an input data 'row' was present relative to the other - though you > could do this by repeating the row 'n' times] I think we might be talking at cross purposes. I have a matrix X where each row is a feature vector. I also have an array y where y[i] is a real number between 0 and 1. I would like to build a regression model that predicts the y values given the X rows. Now each y[i] value in fact comes from simply counting the number of positive labelled elements in a particular set (set i) and dividing by the number of elements in that set. So I can easily fit this into the model given by the R package glm by replacing each y[i] value by a pair of "Number of positives" and "Number of negatives" (this is case 2 in the docs I quoted) or using case 3 which asks for the y[i] plus the total number of elements in set i. I don't see how a single integer for sample_weight[i] would cover this information but I am sure I must have misunderstood. At best it seems it could cover the number of positive values but this is missing half the information. Raphael > > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: >> >> How do I use sample_weight for my use case? >> >> In my case is "y" an array of 0s and 1s and sample_weight then an >> array real numbers between 0 and 1 where I should make sure to set >> sample_weight[i]= 0 when y[i] = 0? >> >> Raphael >> >> On 10 October 2016 at 12:08, Sean Violante >> wrote: >> > should be the sample weight function in fit >> > >> > >> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> > >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> >> >> I just noticed this about the glm package in R. >> >> http://stats.stackexchange.com/a/26779/53128 >> >> >> >> " >> >> The glm function in R allows 3 ways to specify the formula for a >> >> logistic regression model. >> >> >> >> The most common is that each row of the data frame represents a single >> >> observation and the response variable is either 0 or 1 (or a factor >> >> with 2 levels, or other varibale with only 2 unique values). >> >> >> >> Another option is to use a 2 column matrix as the response variable >> >> with the first column being the counts of 'successes' and the second >> >> column being the counts of 'failures'. >> >> >> >> You can also specify the response as a proportion between 0 and 1, >> >> then specify another column as the 'weight' that gives the total >> >> number that the proportion is from (so a response of 0.3 and a weight >> >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> >> >> Either of the last two options would do for me. Does scikit-learn >> >> support either of these last two options? >> >> >> >> Raphael >> >> >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> >> > I am trying to perform regression where my dependent variable is >> >> > constrained to be between 0 and 1. This constraint comes from the >> >> > fact >> >> > that it represents a count proportion. That is counts in some >> >> > category >> >> > divided by a total count. >> >> > >> >> > In the literature it seems that one common way to tackle this is to >> >> > use logistic regression. However, it appears that in scikit learn >> >> > logistic regression is only available as a classifier >> >> > >> >> > >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> >> > ) . Is that right? >> >> > >> >> > Is there another way to perform regression using scikit learn where >> >> > the dependent variable is a count proportion? >> >> > >> >> > Thanks for any help. >> >> > >> >> > Raphael >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From sean.violante at gmail.com Mon Oct 10 10:04:45 2016 From: sean.violante at gmail.com (Sean Violante) Date: Mon, 10 Oct 2016 16:04:45 +0200 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: sorry yes there was a misunderstanding: I meant for each feature configuration you should pass in two rows (one for the positive cases and one for the negative) and the sample weight being the corresponding count for that configuration and class and I am saying that the total count is important because you could have a situation where one feature combination occurs 10 times and another feature combination 1000 times On Mon, Oct 10, 2016 at 3:48 PM, Raphael C wrote: > On 10 October 2016 at 12:22, Sean Violante > wrote: > > no ( but please check !) > > > > sample weights should be the counts for the respective label (0/1) > > > > [ I am actually puzzled about the glm help file - proportions loses how > > often an input data 'row' was present relative to the other - though you > > could do this by repeating the row 'n' times] > > I think we might be talking at cross purposes. > > I have a matrix X where each row is a feature vector. I also have an > array y where y[i] is a real number between 0 and 1. I would like to > build a regression model that predicts the y values given the X rows. > > Now each y[i] value in fact comes from simply counting the number of > positive labelled elements in a particular set (set i) and dividing by > the number of elements in that set. So I can easily fit this into the > model given by the R package glm by replacing each y[i] value by a > pair of "Number of positives" and "Number of negatives" (this is case > 2 in the docs I quoted) or using case 3 which asks for the y[i] plus > the total number of elements in set i. > > I don't see how a single integer for sample_weight[i] would cover this > information but I am sure I must have misunderstood. At best it seems > it could cover the number of positive values but this is missing half > the information. > > Raphael > > > > > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: > >> > >> How do I use sample_weight for my use case? > >> > >> In my case is "y" an array of 0s and 1s and sample_weight then an > >> array real numbers between 0 and 1 where I should make sure to set > >> sample_weight[i]= 0 when y[i] = 0? > >> > >> Raphael > >> > >> On 10 October 2016 at 12:08, Sean Violante > >> wrote: > >> > should be the sample weight function in fit > >> > > >> > > >> > http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.LogisticRegression.html > >> > > >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: > >> >> > >> >> I just noticed this about the glm package in R. > >> >> http://stats.stackexchange.com/a/26779/53128 > >> >> > >> >> " > >> >> The glm function in R allows 3 ways to specify the formula for a > >> >> logistic regression model. > >> >> > >> >> The most common is that each row of the data frame represents a > single > >> >> observation and the response variable is either 0 or 1 (or a factor > >> >> with 2 levels, or other varibale with only 2 unique values). > >> >> > >> >> Another option is to use a 2 column matrix as the response variable > >> >> with the first column being the counts of 'successes' and the second > >> >> column being the counts of 'failures'. > >> >> > >> >> You can also specify the response as a proportion between 0 and 1, > >> >> then specify another column as the 'weight' that gives the total > >> >> number that the proportion is from (so a response of 0.3 and a weight > >> >> of 10 is the same as 3 'successes' and 7 'failures')." > >> >> > >> >> Either of the last two options would do for me. Does scikit-learn > >> >> support either of these last two options? > >> >> > >> >> Raphael > >> >> > >> >> On 10 October 2016 at 11:55, Raphael C wrote: > >> >> > I am trying to perform regression where my dependent variable is > >> >> > constrained to be between 0 and 1. This constraint comes from the > >> >> > fact > >> >> > that it represents a count proportion. That is counts in some > >> >> > category > >> >> > divided by a total count. > >> >> > > >> >> > In the literature it seems that one common way to tackle this is to > >> >> > use logistic regression. However, it appears that in scikit learn > >> >> > logistic regression is only available as a classifier > >> >> > > >> >> > > >> >> > (http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.LogisticRegression.html > >> >> > ) . Is that right? > >> >> > > >> >> > Is there another way to perform regression using scikit learn where > >> >> > the dependent variable is a count proportion? > >> >> > > >> >> > Thanks for any help. > >> >> > > >> >> > Raphael > >> >> _______________________________________________ > >> >> scikit-learn mailing list > >> >> scikit-learn at python.org > >> >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > >> > > >> > > >> > _______________________________________________ > >> > scikit-learn mailing list > >> > scikit-learn at python.org > >> > https://mail.python.org/mailman/listinfo/scikit-learn > >> > > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Mon Oct 10 10:46:17 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Mon, 10 Oct 2016 16:46:17 +0200 Subject: [scikit-learn] Using logistic regression with count proportions data In-Reply-To: References: Message-ID: Here is a possibly useful comment of larsmans on stackoverflow about exactly this procedure http://stackoverflow.com/questions/26604175/how-to-predict-a-continuous-dependent-variable-that-expresses-target-class-proba/26614131#comment41846816_26614131 On Mon, Oct 10, 2016 at 4:04 PM, Sean Violante wrote: > sorry yes there was a misunderstanding: > > I meant for each feature configuration you should pass in two rows (one > for the positive cases and one for the negative) > and the sample weight being the corresponding count for that configuration > and class > > and I am saying that the total count is important because you could have > a situation where > one feature combination occurs 10 times and another feature combination > 1000 times > > > > > > On Mon, Oct 10, 2016 at 3:48 PM, Raphael C wrote: > >> On 10 October 2016 at 12:22, Sean Violante >> wrote: >> > no ( but please check !) >> > >> > sample weights should be the counts for the respective label (0/1) >> > >> > [ I am actually puzzled about the glm help file - proportions loses how >> > often an input data 'row' was present relative to the other - though you >> > could do this by repeating the row 'n' times] >> >> I think we might be talking at cross purposes. >> >> I have a matrix X where each row is a feature vector. I also have an >> array y where y[i] is a real number between 0 and 1. I would like to >> build a regression model that predicts the y values given the X rows. >> >> Now each y[i] value in fact comes from simply counting the number of >> positive labelled elements in a particular set (set i) and dividing by >> the number of elements in that set. So I can easily fit this into the >> model given by the R package glm by replacing each y[i] value by a >> pair of "Number of positives" and "Number of negatives" (this is case >> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus >> the total number of elements in set i. >> >> I don't see how a single integer for sample_weight[i] would cover this >> information but I am sure I must have misunderstood. At best it seems >> it could cover the number of positive values but this is missing half >> the information. >> >> Raphael >> >> > >> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: >> >> >> >> How do I use sample_weight for my use case? >> >> >> >> In my case is "y" an array of 0s and 1s and sample_weight then an >> >> array real numbers between 0 and 1 where I should make sure to set >> >> sample_weight[i]= 0 when y[i] = 0? >> >> >> >> Raphael >> >> >> >> On 10 October 2016 at 12:08, Sean Violante >> >> wrote: >> >> > should be the sample weight function in fit >> >> > >> >> > >> >> > http://scikit-learn.org/stable/modules/generated/sklearn. >> linear_model.LogisticRegression.html >> >> > >> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> >> >> >> >> I just noticed this about the glm package in R. >> >> >> http://stats.stackexchange.com/a/26779/53128 >> >> >> >> >> >> " >> >> >> The glm function in R allows 3 ways to specify the formula for a >> >> >> logistic regression model. >> >> >> >> >> >> The most common is that each row of the data frame represents a >> single >> >> >> observation and the response variable is either 0 or 1 (or a factor >> >> >> with 2 levels, or other varibale with only 2 unique values). >> >> >> >> >> >> Another option is to use a 2 column matrix as the response variable >> >> >> with the first column being the counts of 'successes' and the second >> >> >> column being the counts of 'failures'. >> >> >> >> >> >> You can also specify the response as a proportion between 0 and 1, >> >> >> then specify another column as the 'weight' that gives the total >> >> >> number that the proportion is from (so a response of 0.3 and a >> weight >> >> >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> >> >> >> >> Either of the last two options would do for me. Does scikit-learn >> >> >> support either of these last two options? >> >> >> >> >> >> Raphael >> >> >> >> >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> >> >> > I am trying to perform regression where my dependent variable is >> >> >> > constrained to be between 0 and 1. This constraint comes from the >> >> >> > fact >> >> >> > that it represents a count proportion. That is counts in some >> >> >> > category >> >> >> > divided by a total count. >> >> >> > >> >> >> > In the literature it seems that one common way to tackle this is >> to >> >> >> > use logistic regression. However, it appears that in scikit learn >> >> >> > logistic regression is only available as a classifier >> >> >> > >> >> >> > >> >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn. >> linear_model.LogisticRegression.html >> >> >> > ) . Is that right? >> >> >> > >> >> >> > Is there another way to perform regression using scikit learn >> where >> >> >> > the dependent variable is a count proportion? >> >> >> > >> >> >> > Thanks for any help. >> >> >> > >> >> >> > Raphael >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > >> >> > >> >> > >> >> > _______________________________________________ >> >> > scikit-learn mailing list >> >> > scikit-learn at python.org >> >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> > >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From siddharthgupta234 at gmail.com Tue Oct 11 01:18:57 2016 From: siddharthgupta234 at gmail.com (Siddharth Gupta) Date: Tue, 11 Oct 2016 10:48:57 +0530 Subject: [scikit-learn] Doubt regarding issue timeline Message-ID: Hello fellas, I have a doubt. Suppose I ask to volunteer in working on an issue but due to some unavoidable scenario I fail to work on it for sometime, when should I let the community know about the same. I guess it depends on the issue/bug, but on an average how much time should one take to resolve an issue. Regards Siddharth Gupta, Ph: 9871012292 Linkedin | Github | Codechef | Twitter | Facebook -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaquesgrobler at gmail.com Tue Oct 11 01:49:30 2016 From: jaquesgrobler at gmail.com (Jaques Grobler) Date: Tue, 11 Oct 2016 07:49:30 +0200 Subject: [scikit-learn] Doubt regarding issue timeline In-Reply-To: References: Message-ID: I'd say a 'standup'-ish approach could work with this - everyday or three, if you find yourself getting pulled off the issue by other work, life, etc. perhaps take a moment to at a set time to , if needed, post on the progress/blocking factors -- even if it's 'can't work in this today' - yes, this could potentially get spammy, but it gives nice transparency and if it's urgent to finish the issue soon, like before a release, the community can know wether or not it needs to be handed over - or if you believe you'll have time still -- This doesn't have to be a rule- but more of a guide line - the community will always have a fairly recent status update, even if the person can't touch the issue for weeks. Just my thoughts on it :) On Tuesday, 11 October 2016, Siddharth Gupta wrote: > Hello fellas, > I have a doubt. Suppose I ask to volunteer in working on an issue but due > to some unavoidable scenario I fail to work on it for sometime, when should > I let the community know about the same. I guess it depends on the > issue/bug, but on an average how much time should one take to resolve an > issue. > > Regards Siddharth Gupta, > Ph: 9871012292 > Linkedin | Github > | Codechef > | Twitter > | Facebook > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.vanschoren at tue.nl Tue Oct 11 05:11:00 2016 From: j.vanschoren at tue.nl (Joaquin Vanschoren) Date: Tue, 11 Oct 2016 09:11:00 +0000 Subject: [scikit-learn] Welcome Raghav to the core-dev team In-Reply-To: References: <20161003151415.GF20745@phare.normalesup.org> <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com> Message-ID: A bit late, but heartfelt congrats to Raghav :) On Tue, Oct 4, 2016 at 12:43 PM Joel Nothman wrote: > Congratulations, Raghav! Thanks for your dedication, as a student and > mentor in GSoC, but at all other times too! > > On 4 October 2016 at 19:14, Jaques Grobler > wrote: > > Congrats Raghav! > > 2016-10-03 21:25 GMT+02:00 Andreas Mueller : > > Congrats, hope to see lot's more ;) > > > On 10/03/2016 12:09 PM, Raghav R V wrote: > > Thanks everyone! Looking forward to contributing more :D > > On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose > wrote: > > congrats! :) > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen > wrote: > > Congrats, Raghav! > > Nelson Liu ? 2016?10?3? ?? ??11:27??? > > Yay! Congrats, Raghav! > > On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > > Hi, > > We have the pleasure to welcome Raghav RV to the core-dev team. Raghav > (@raghavrv) has been working on scikit-learn for more than a year. In > particular, he implemented the rewrite of the cross-validation utilities, > which is quite dear to my heart. > > Welcome Raghav! > > Ga?l > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gabit7 at gmail.com Tue Oct 11 07:29:20 2016 From: gabit7 at gmail.com (Gabriel Trautmann) Date: Tue, 11 Oct 2016 14:29:20 +0300 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 Message-ID: Hi, After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times slower. Before: scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 4.594092130661011 seconds, resulting shape (11314, 1048576) After upgrade: scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 43.587692737579346 seconds, resulting shape (11314, 1048576) Code: import time, sklearn, platform, numpy from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import HashingVectorizer data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) print('scikit-learn {}. Numpy {}. Python {} {}'.format(sklearn.__version__, numpy.version.full_version, platform.python_version(), platform.machine())) vectorizer = HashingVectorizer() print("Vectorizing 20newsgroup {} documents".format(len(data_train.data))) start = time.time() data = vectorizer.fit_transform(data_train.data) print("Vectorization completed in ", time.time() - start, ' seconds, resulting shape ', data.shape) Should I submit a bug report? Thank you, Gabriel Trautmann -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Oct 11 08:02:47 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 11 Oct 2016 14:02:47 +0200 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: I cannot reproduce such a degradation on my machine: (sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$python ~/tmp/bench_vectorizer.py scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 4.033604383468628 seconds, resulting shape (11314, 1048576) (sklearn-0.18) ogrisel at is146148:~/code/scikit-learn$ python ~/tmp/bench_vectorizer.py scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64 Vectorizing 20newsgroup 11314 documents Vectorization completed in 4.990509510040283 seconds, resulting shape (11314, 1048576) Which operating system are you using? Please feel free to open an issue on the tracker anyway. -- Olivier From gabit7 at gmail.com Tue Oct 11 08:19:24 2016 From: gabit7 at gmail.com (Gabriel Trautmann) Date: Tue, 11 Oct 2016 15:19:24 +0300 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: Thank you for your response, have Windows 7 Enterprise 64 bit / Intel Xeon E5 2640 CPU, same problem on two similar machines python-3.5.2-amd64.exe - fresh installation numpy-1.11.2+mkl-cp35-cp35m-win_amd64.whl - from Christoph Gohlke scipy-0.18.1-cp35-cp35m-win_amd64.whl pip install scikit-lean on the same python instance if I downgrade to version 0.17 is much faster. pip uninstall scikit-lean pip install scikit-lean==0.17 I will open an issue after I test on more machines or if someone else can reproduce the problem. On Tue, Oct 11, 2016 at 3:02 PM, Olivier Grisel wrote: > I cannot reproduce such a degradation on my machine: > > (sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$python > ~/tmp/bench_vectorizer.py > scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64 > Vectorizing 20newsgroup 11314 documents > Vectorization completed in 4.033604383468628 seconds, resulting > shape (11314, 1048576) > > (sklearn-0.18) ogrisel at is146148:~/code/scikit-learn$ python > ~/tmp/bench_vectorizer.py > scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64 > Vectorizing 20newsgroup 11314 documents > Vectorization completed in 4.990509510040283 seconds, resulting > shape (11314, 1048576) > > Which operating system are you using? > > Please feel free to open an issue on the tracker anyway. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Tue Oct 11 08:32:54 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Tue, 11 Oct 2016 12:32:54 +0000 Subject: [scikit-learn] ANN Scikit-learn 0.18 released In-Reply-To: References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com> <20160929052856.GA1123098@phare.normalesup.org> <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> Message-ID: Congratulations to all contributors! I would like to update to the new version using conda, but apparently it is not available: ~$conda update scikit-learn Fetching package metadata ....... Solving package specifications: .......... # All requested packages already installed. # packages in environment at /home/pbialecki/anaconda2: # scikit-learn 0.17.1 np110py27_2 Should I reinstall scikit? Best regards, Piotr On 03.10.2016 18:23, Raghav R V wrote: Hi Brown, Thanks for the email. There is a working PR here at https://github.com/scikit-learn/scikit-learn/pull/7388 Would you be kind to take a look at it and comment how helpful the proposed API is for your use case? Thanks On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. > wrote: Hello community, Congratulations on the release of 0.19 ! While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts! 2016-10-01 1:58 GMT+09:00 Andreas Mueller <t3kcit at gmail.com>: We've got a lot in the works already for 0.19. * multiple metrics for cross validation (#7388 et al.) I've done something like this in my internal model building and selection libraries. My solution has been to have -each metric object be able to explain a "distance from optimal" -a metric collection object, which can be built by either explicit instantiation or calculation using data -a pareto curve calculation object -a ranker for the points on the pareto curve, with the ability to select the N-best points. While there are certainly smarter interfaces and implementations, here is an example of one of my doctests that may help get this PR started. My apologies that my old docstring argument notation doesn't match the commonly used standards. Hope this helps, J.B. Brown Kyoto University 26 class TrialRanker(object): 27 """An object for handling the generic mechanism of selecting optimal 28 trials from a colletion of trials.""" 43 def SelectBest(self, metricSets, paretoAlg, 44 preProcessor=None): 45 """Select the best [metricSets] by using the 46 [paretoAlg] pareto selection object. Note that it is actually 47 the [paretoAlg] that specifies how many optimal [metricSets] to 48 select. 49 50 Data may be pre-processed into a form necessary for the [paretoAlg] 51 by using the [preProcessor] that is a MetricSetConverter. 52 53 Return: an EvaluatedMetricSet if [paretoAlg] selects only one 54 metric set, otherwise a list of EvaluatedMetricSet objects. 55 56 >>> from pareto.paretoDecorators import MinNormSelector 57 >>> from pareto import OriginBasePareto 58 >>> pAlg = MinNormSelector(OriginBasePareto()) 59 60 >>> from metrics.TwoClassMetrics import Accuracy, Sensitivity 61 >>> from metrics.metricSet import EvaluatedMetricSet 62 >>> met1 = EvaluatedMetricSet.BuildByExplicitValue( 63 ... [(Accuracy, 0.7), (Sensitivity, 0.9)]) 64 >>> met1.SetTitle("Example1") 65 >>> met1.associatedData = range(5) # property set/get 66 >>> met2 = EvaluatedMetricSet.BuildByExplicitValue( 67 ... [(Accuracy, 0.8), (Sensitivity, 0.6)]) 68 >>> met2.SetTitle("Example2") 69 >>> met2.SetAssociatedData("abcdef") # explicit method call 70 >>> met3 = EvaluatedMetricSet.BuildByExplicitValue( 71 ... [(Accuracy, 0.5), (Sensitivity, 0.5)]) 72 >>> met3.SetTitle("Example3") 73 >>> met3.associatedData = float 74 75 >>> from metrics.metricSet.converters import OptDistConverter 76 77 >>> ranker = TrialRanker() # pAlg selects met1 78 >>> best = ranker.SelectBest((met1,met2,met3), 79 ... pAlg, OptDistConverter()) 80 >>> best.VerboseDescription(True) 81 >>> str(best) 82 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 83 >>> best.associatedData 84 [0, 1, 2, 3, 4] 85 86 >>> pAlg = MinNormSelector(OriginBasePareto(), nSelect=2) 87 >>> best = ranker.SelectBest((met1,met2,met3), 88 ... pAlg, OptDistConverter()) 89 >>> for metSet in best: 90 ... metSet.VerboseDescription(True) 91 ... str(metSet) 92 ... str(metSet.associatedData) 93 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 94 '[0, 1, 2, 3, 4]' 95 'Example2: 2 metrics; Accuracy=0.800; Sensitivity=0.600' 96 'abcdef' 97 98 >>> from metrics.TwoClassMetrics import PositivePredictiveValue 99 >>> met4 = EvaluatedMetricSet.BuildByExplicitValue( 100 ... [(Accuracy, 0.7), (PositivePredictiveValue, 0.5)]) 101 >>> best = ranker.SelectBest((met1,met2,met3,met4), 102 ... pAlg, OptDistConverter()) 103 Traceback (most recent call last): 104 ... 105 ValueError: Metric sets contain differing Metrics. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Tue Oct 11 08:39:07 2016 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Tue, 11 Oct 2016 14:39:07 +0200 Subject: [scikit-learn] ANN Scikit-learn 0.18 released In-Reply-To: References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com> <20160929052856.GA1123098@phare.normalesup.org> <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> Message-ID: Hi Piotr, I've been there - most probably some package is blocking you to update via numpy dependency. Try to update numpy first and the conflicting package should pop up: "conda update numpy=1.11" ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-10-11 14:32 GMT+02:00 Piotr Bialecki : > Congratulations to all contributors! > > I would like to update to the new version using conda, but apparently it > is not available: > > ~$ conda update scikit-learn > Fetching package metadata ....... > Solving package specifications: .......... > > # All requested packages already installed. > # packages in environment at /home/pbialecki/anaconda2: > # > scikit-learn 0.17.1 np110py27_2 > > Should I reinstall scikit? > > > Best regards, > Piotr > > > > On 03.10.2016 18:23, Raghav R V wrote: > > Hi Brown, > > Thanks for the email. There is a working PR here at > > https://github.com/scikit-learn/scikit-learn/pull/7388 > > Would you be kind to take a look at it and comment how helpful the > proposed API is for your use case? > > Thanks > > > On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. > wrote: > >> Hello community, >> >> Congratulations on the release of 0.19 ! >> While I'm merely a casual user and wish I could contribute more often, I >> thank everyone for their time and efforts! >> >> 2016-10-01 1:58 GMT+09:00 Andreas Mueller < >> t3kcit at gmail.com>: >> >> We've got a lot in the works already for 0.19. >>>> >>>> * multiple metrics for cross validation (#7388 et al.) >>>> >>> >> I've done something like this in my internal model building and selection >> libraries. >> My solution has been to have >> -each metric object be able to explain a "distance from optimal" >> -a metric collection object, which can be built by either explicit >> instantiation or calculation using data >> -a pareto curve calculation object >> -a ranker for the points on the pareto curve, with the ability to >> select the N-best points. >> >> While there are certainly smarter interfaces and implementations, here is >> an example of one of my doctests that may help get this PR started. >> My apologies that my old docstring argument notation doesn't match the >> commonly used standards. >> >> Hope this helps, >> J.B. Brown >> Kyoto University >> >> 26 class TrialRanker(object): >> >> 27 """An object for handling the generic mechanism of selecting >> optimal >> 28 trials from a colletion of trials.""" >> >> 43 def SelectBest(self, metricSets, paretoAlg, >> >> 44 preProcessor=None): >> >> 45 """Select the best [metricSets] by using >> the >> 46 [paretoAlg] pareto selection object. Note that it is >> actually >> 47 the [paretoAlg] that specifies how many optimal [metricSets] >> to >> 48 select. >> >> 49 >> >> 50 Data may be pre-processed into a form necessary for the >> [paretoAlg] >> 51 by using the [preProcessor] that is a >> MetricSetConverter. >> 52 >> >> 53 Return: an EvaluatedMetricSet if [paretoAlg] selects only >> one >> 54 metric set, otherwise a list of EvaluatedMetricSet >> objects. >> 55 >> >> 56 >>> from pareto.paretoDecorators import >> MinNormSelector >> 57 >>> from pareto import OriginBasePareto >> >> 58 >>> pAlg = MinNormSelector(OriginBasePare >> to()) >> 59 >> >> 60 >>> from metrics.TwoClassMetrics import Accuracy, >> Sensitivity >> 61 >>> from metrics.metricSet import >> EvaluatedMetricSet >> 62 >>> met1 = EvaluatedMetricSet.BuildByExpl >> icitValue( >> 63 ... [(Accuracy, 0.7), (Sensitivity, >> 0.9)]) >> 64 >>> met1.SetTitle("Example1") >> >> 65 >>> met1.associatedData = range(5) # property >> set/get >> 66 >>> met2 = EvaluatedMetricSet.BuildByExpl >> icitValue( >> 67 ... [(Accuracy, 0.8), (Sensitivity, >> 0.6)]) >> 68 >>> met2.SetTitle("Example2") >> >> 69 >>> met2.SetAssociatedData("abcdef") # explicit method >> call >> 70 >>> met3 = EvaluatedMetricSet.BuildByExpl >> icitValue( >> 71 ... [(Accuracy, 0.5), (Sensitivity, >> 0.5)]) >> 72 >>> met3.SetTitle("Example3") >> >> 73 >>> met3.associatedData = float >> >> 74 >> >> 75 >>> from metrics.metricSet.converters import >> OptDistConverter >> 76 >> >> 77 >>> ranker = TrialRanker() # pAlg selects >> met1 >> 78 >>> best = ranker.SelectBest((met1,met2,m >> et3), >> 79 ... pAlg, >> OptDistConverter()) >> 80 >>> best.VerboseDescription(True) >> >> 81 >>> str(best) >> >> 82 'Example1: 2 metrics; Accuracy=0.700; >> Sensitivity=0.900' >> 83 >>> best.associatedData >> >> 84 [0, 1, 2, 3, 4] >> >> 85 >> >> 86 >>> pAlg = MinNormSelector(OriginBasePareto(), >> nSelect=2) >> 87 >>> best = ranker.SelectBest((met1,met2,m >> et3), >> 88 ... pAlg, >> OptDistConverter()) >> 89 >>> for metSet in best: >> >> 90 ... metSet.VerboseDescription(True >> ) >> 91 ... str(metSet) >> >> 92 ... str(metSet.associatedData) >> >> 93 'Example1: 2 metrics; Accuracy=0.700; >> Sensitivity=0.900' >> 94 '[0, 1, 2, 3, 4]' >> >> 95 'Example2: 2 metrics; Accuracy=0.800; >> Sensitivity=0.600' >> 96 'abcdef' >> >> 97 >> >> 98 >>> from metrics.TwoClassMetrics import >> PositivePredictiveValue >> 99 >>> met4 = EvaluatedMetricSet.BuildByExpl >> icitValue( >> 100 ... [(Accuracy, 0.7), (PositivePredictiveValue, >> 0.5)]) >> 101 >>> best = ranker.SelectBest((met1,met2,m >> et3,met4), >> 102 ... pAlg, >> OptDistConverter()) >> 103 Traceback (most recent call last): >> >> 104 ... >> >> 105 ValueError: Metric sets contain differing >> Metrics. >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Tue Oct 11 08:47:28 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Tue, 11 Oct 2016 12:47:28 +0000 Subject: [scikit-learn] ANN Scikit-learn 0.18 released In-Reply-To: References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com> <20160929052856.GA1123098@phare.normalesup.org> <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com> Message-ID: Hi Maciek, thank you very much! Numpy and opencv were indeed the conflicted packages. Apperently my version of opencv was using numpy 1.10, so I uninstalled opencv, updated numpy and updated scikit to 0.18. Thank's for the fast help! Best regards, Piotr On 11.10.2016 14:39, Maciek W?jcikowski wrote: Hi Piotr, I've been there - most probably some package is blocking you to update via numpy dependency. Try to update numpy first and the conflicting package should pop up: "conda update numpy=1.11" ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2016-10-11 14:32 GMT+02:00 Piotr Bialecki >: Congratulations to all contributors! I would like to update to the new version using conda, but apparently it is not available: ~$conda update scikit-learn Fetching package metadata ....... Solving package specifications: .......... # All requested packages already installed. # packages in environment at /home/pbialecki/anaconda2: # scikit-learn 0.17.1 np110py27_2 Should I reinstall scikit? Best regards, Piotr On 03.10.2016 18:23, Raghav R V wrote: Hi Brown, Thanks for the email. There is a working PR here at https://github.com/scikit-learn/scikit-learn/pull/7388 Would you be kind to take a look at it and comment how helpful the proposed API is for your use case? Thanks On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. > wrote: Hello community, Congratulations on the release of 0.19 ! While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts! 2016-10-01 1:58 GMT+09:00 Andreas Mueller >: We've got a lot in the works already for 0.19. * multiple metrics for cross validation (#7388 et al.) I've done something like this in my internal model building and selection libraries. My solution has been to have -each metric object be able to explain a "distance from optimal" -a metric collection object, which can be built by either explicit instantiation or calculation using data -a pareto curve calculation object -a ranker for the points on the pareto curve, with the ability to select the N-best points. While there are certainly smarter interfaces and implementations, here is an example of one of my doctests that may help get this PR started. My apologies that my old docstring argument notation doesn't match the commonly used standards. Hope this helps, J.B. Brown Kyoto University 26 class TrialRanker(object): 27 """An object for handling the generic mechanism of selecting optimal 28 trials from a colletion of trials.""" 43 def SelectBest(self, metricSets, paretoAlg, 44 preProcessor=None): 45 """Select the best [metricSets] by using the 46 [paretoAlg] pareto selection object. Note that it is actually 47 the [paretoAlg] that specifies how many optimal [metricSets] to 48 select. 49 50 Data may be pre-processed into a form necessary for the [paretoAlg] 51 by using the [preProcessor] that is a MetricSetConverter. 52 53 Return: an EvaluatedMetricSet if [paretoAlg] selects only one 54 metric set, otherwise a list of EvaluatedMetricSet objects. 55 56 >>> from pareto.paretoDecorators import MinNormSelector 57 >>> from pareto import OriginBasePareto 58 >>> pAlg = MinNormSelector(OriginBasePareto()) 59 60 >>> from metrics.TwoClassMetrics import Accuracy, Sensitivity 61 >>> from metrics.metricSet import EvaluatedMetricSet 62 >>> met1 = EvaluatedMetricSet.BuildByExplicitValue( 63 ... [(Accuracy, 0.7), (Sensitivity, 0.9)]) 64 >>> met1.SetTitle("Example1") 65 >>> met1.associatedData = range(5) # property set/get 66 >>> met2 = EvaluatedMetricSet.BuildByExplicitValue( 67 ... [(Accuracy, 0.8), (Sensitivity, 0.6)]) 68 >>> met2.SetTitle("Example2") 69 >>> met2.SetAssociatedData("abcdef") # explicit method call 70 >>> met3 = EvaluatedMetricSet.BuildByExplicitValue( 71 ... [(Accuracy, 0.5), (Sensitivity, 0.5)]) 72 >>> met3.SetTitle("Example3") 73 >>> met3.associatedData = float 74 75 >>> from metrics.metricSet.converters import OptDistConverter 76 77 >>> ranker = TrialRanker() # pAlg selects met1 78 >>> best = ranker.SelectBest((met1,met2,met3), 79 ... pAlg, OptDistConverter()) 80 >>> best.VerboseDescription(True) 81 >>> str(best) 82 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 83 >>> best.associatedData 84 [0, 1, 2, 3, 4] 85 86 >>> pAlg = MinNormSelector(OriginBasePareto(), nSelect=2) 87 >>> best = ranker.SelectBest((met1,met2,met3), 88 ... pAlg, OptDistConverter()) 89 >>> for metSet in best: 90 ... metSet.VerboseDescription(True) 91 ... str(metSet) 92 ... str(metSet.associatedData) 93 'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900' 94 '[0, 1, 2, 3, 4]' 95 'Example2: 2 metrics; Accuracy=0.800; Sensitivity=0.600' 96 'abcdef' 97 98 >>> from metrics.TwoClassMetrics import PositivePredictiveValue 99 >>> met4 = EvaluatedMetricSet.BuildByExplicitValue( 100 ... [(Accuracy, 0.7), (PositivePredictiveValue, 0.5)]) 101 >>> best = ranker.SelectBest((met1,met2,met3,met4), 102 ... pAlg, OptDistConverter()) 103 Traceback (most recent call last): 104 ... 105 ValueError: Metric sets contain differing Metrics. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Oct 11 09:44:08 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 11 Oct 2016 15:44:08 +0200 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: That's really weird. I don't have a windows machine handy at the moment. It would be nice if someone else could confirm. Could you please run the Python profiler on this to see where the time is spent on the slow setup? -- Olivier From piotr.bialecki at hotmail.de Tue Oct 11 10:03:29 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Tue, 11 Oct 2016 14:03:29 +0000 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: I just tested it on my Ubuntu machine and could not see any performance issues (5.68 seconds in scikit-learn 0.17 vs. 6.67 seconds in scikit-learn 0.18) However, on another Windows 10 machine I could indeed see this issue: scikit-learn 0.17.1. Numpy 1.11.1. Python 2.7.12 AMD64 Vectorizing 20newsgroup 11314 documents ('Vectorization completed in ', 5.608999967575073, ' seconds, resulting shape ', (11314, 1048576)) scikit-learn 0.18. Numpy 1.11.1. Python 2.7.12 AMD64 Vectorizing 20newsgroup 11314 documents ('Vectorization completed in ', 27.924000024795532, ' seconds, resulting shape ', (11314, 1048576)) On 11.10.2016 15:44, Olivier Grisel wrote: > That's really weird. I don't have a windows machine handy at the > moment. It would be nice if someone else could confirm. > > Could you please run the Python profiler on this to see where the time > is spent on the slow setup? > From gael.varoquaux at normalesup.org Tue Oct 11 09:49:17 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 11 Oct 2016 15:49:17 +0200 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: <20161011134917.GI4179541@phare.normalesup.org> Could it be a case of compilation: it seems to me that we are compiling MKL vs non MKL builds. From mathieu at mblondel.org Tue Oct 11 11:13:30 2016 From: mathieu at mblondel.org (Mathieu Blondel) Date: Wed, 12 Oct 2016 00:13:30 +0900 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: <20161011134917.GI4179541@phare.normalesup.org> References: <20161011134917.GI4179541@phare.normalesup.org> Message-ID: On Tue, Oct 11, 2016 at 10:49 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > Could it be a case of compilation: it seems to me that we are compiling > MKL vs non MKL builds. > The hashing vectorizer is written in Cython and doesn't use BLAS, though. Mathieu -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 11 14:56:02 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 11 Oct 2016 14:56:02 -0400 Subject: [scikit-learn] HashingVectorizer slow in version 0.18 In-Reply-To: References: Message-ID: Please open an issue on the issue tracker: https://github.com/scikit-learn/scikit-learn/issues On 10/11/2016 08:19 AM, Gabriel Trautmann wrote: > Thank you for your response, have Windows 7 Enterprise 64 bit / Intel > Xeon E5 2640 CPU, same problem on two similar machines > > python-3.5.2-amd64.exe - fresh installation > > numpy-1.11.2+mkl-cp35-cp35m-win_amd64.whl - from Christoph Gohlke > scipy-0.18.1-cp35-cp35m-win_amd64.whl > pip install scikit-lean > > on the same python instance if I downgrade to version 0.17 is much faster. > > pip uninstall scikit-lean > pip install scikit-lean==0.17 > > I will open an issue after I test on more machines or if someone else > can reproduce the problem. > > > > > On Tue, Oct 11, 2016 at 3:02 PM, Olivier Grisel > > wrote: > > I cannot reproduce such a degradation on my machine: > > (sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$ python > ~/tmp/bench_vectorizer.py > scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64 > Vectorizing 20newsgroup 11314 documents > Vectorization completed in 4.033604383468628 seconds, resulting > shape (11314, 1048576) > > (sklearn-0.18) ogrisel at is146148:~/code/scikit-learn\$ python > ~/tmp/bench_vectorizer.py > scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64 > Vectorizing 20newsgroup 11314 documents > Vectorization completed in 4.990509510040283 seconds, resulting > shape (11314, 1048576) > > Which operating system are you using? > > Please feel free to open an issue on the tracker anyway. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Oct 12 08:02:30 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 12 Oct 2016 23:02:30 +1100 Subject: [scikit-learn] Doubt regarding issue timeline In-Reply-To: References: Message-ID: If you have a sense that the issue is urgent in some way, then give it up quickly if you've said you'd do it. Otherwise, it's okay to take a few weeks. Yes, it would be kind, if it looks like you won't be able to do it, to say you can't. Sorry there are no hard rules, but thanks for trying to clarify On 11 October 2016 at 16:49, Jaques Grobler wrote: > I'd say a 'standup'-ish approach could work with this - everyday or three, > if you find yourself getting pulled off the issue by other work, life, > etc. perhaps take a moment to at a set time to , if needed, post on the > progress/blocking factors -- even if it's 'can't work in this today' - yes, > this could potentially get spammy, but it gives nice transparency and if > it's urgent to finish the issue soon, like before a release, the community > can know wether or not it needs to be handed over - or if you believe > you'll have time still -- > This doesn't have to be a rule- but more of a guide line - the community > will always have a fairly recent status update, even if the person can't > touch the issue for weeks. > > Just my thoughts on it :) > > > On Tuesday, 11 October 2016, Siddharth Gupta > wrote: > >> Hello fellas, >> I have a doubt. Suppose I ask to volunteer in working on an issue but due >> to some unavoidable scenario I fail to work on it for sometime, when should >> I let the community know about the same. I guess it depends on the >> issue/bug, but on an average how much time should one take to resolve an >> issue. >> >> Regards Siddharth Gupta, >> Ph: 9871012292 >> Linkedin | Github >> | Codechef >> | Twitter >> | Facebook >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Oct 13 11:36:20 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 13 Oct 2016 11:36:20 -0400 Subject: [scikit-learn] Permission for creating new labels In-Reply-To: References: <20161012133118.GB1206164@phare.normalesup.org> Message-ID: <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com> going to the mailing list On 10/13/2016 01:35 AM, Raghav R V wrote: > Thanks for the messages {Ga|Jo}el. ;) > > > We can use "needs second review" as an alternative to "MRG+1" but I > don't see the point of using both. > > I see the system of MRG+1 and MRG+2 as a more robust way of tracking > approvals to see if the PR can be merged (I'm not sure if review > approvals completely replace this?) and "Needs 2nd Review" as a quick > way to search... "Needs 2nd Review" could also be used with MRG PRs > which have already received a solid review and would need a 2nd look > from those who don't have much time to do a full fledged review... > > >By the way: this discussion should happen on the ML. > > Sorry for that. I wasn't sure if this was a very useful/non-trivial > suggestion and wanted to avoid noise there... > > > "Needs triage": > > I see that we have "Stale" label for that. > I just added this to make it easier to find PRs to review. I'm not sure if it is not redundant with the "needs contrib" tag on a PR. I used "stale" if I was not sure if it's worth working on something and the author didn't respond for a while. I haven't used it a lot yet. I'm ambivalent about adding a "approved by one" (which I think is more explicit then "need one more") tag. You can search for PRs and issues without comments - I recently did that to make sure everything had at least one ;) I'm not sure you can search for the absence of tags. But I am planning to go through all issues tomorrow to see stuff that I have missed. I'll be catching up today with all notifications that I missed this year because writing my ;) Maybe having an list of statuses for PRs and issues that covers the common cases would be good, we just kind of had that discussion, right? Issues can be bug|enhancement|new feature with status needs contributor, has PR or needs confirmation/discussion. It would be nice to see if a issue has a PR, I think there is no way to do that from the search. PRs need changes or reviews or are stalled (which is "needs changes" for a long time and no response) and then might "need contributor". We could use "needs review" on issues and add a "has PR" tag for issues and a "one approval" tag for PRs. I agree with Joel that switching between "needs review" and "needs changes" in a currently active PR is likely to be cumbersome. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Thu Oct 13 11:41:29 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Thu, 13 Oct 2016 08:41:29 -0700 Subject: [scikit-learn] Permission for creating new labels In-Reply-To: <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com> References: <20161012133118.GB1206164@phare.normalesup.org> <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com> Message-ID: On 13 October 2016 at 08:36, Andreas Mueller wrote: > going to the mailing list > > On 10/13/2016 01:35 AM, Raghav R V wrote: > > Thanks for the messages {Ga|Jo}el. ;) > >> We can use "needs second review" as an alternative to "MRG+1" but I don't >> see the point of using both. > > I see the system of MRG+1 and MRG+2 as a more robust way of tracking > approvals to see if the PR can be merged (I'm not sure if review approvals > completely replace this?) and "Needs 2nd Review" as a quick way to search... > "Needs 2nd Review" could also be used with MRG PRs which have already > received a solid review and would need a 2nd look from those who don't have > much time to do a full fledged review... > >> By the way: this discussion should happen on the ML. > > Sorry for that. I wasn't sure if this was a very useful/non-trivial > suggestion and wanted to avoid noise there... > >> "Needs triage": > > I see that we have "Stale" label for that. > > I just added this to make it easier to find PRs to review. > I'm not sure if it is not redundant with the "needs contrib" tag on a PR. > I used "stale" if I was not sure if it's worth working on something and the > author didn't respond for a while. > I haven't used it a lot yet. > > I'm ambivalent about adding a "approved by one" (which I think is more > explicit then "need one more") tag. > > You can search for PRs and issues without comments - I recently did that to > make sure everything had at least one ;) > I'm not sure you can search for the absence of tags. But I am planning to go > through all issues tomorrow to see stuff > that I have missed. I'll be catching up today with all notifications that I > missed this year because writing my ;) > > Maybe having an list of statuses for PRs and issues that covers the common > cases would be good, we just kind of had that discussion, right? > > Issues can be bug|enhancement|new feature with status needs contributor, has > PR or needs confirmation/discussion. It would be nice to see > if a issue has a PR, I think there is no way to do that from the search. > > PRs need changes or reviews or are stalled (which is "needs changes" for a > long time and no response) and then might "need contributor". > > > We could use "needs review" on issues and add a "has PR" tag for issues and > a "one approval" tag for PRs. > > I agree with Joel that switching between "needs review" and "needs changes" > in a currently active PR is likely to be cumbersome. >From my experience on matplotlib that has such a system, it is a not a very good idea? Reviewers rarely change the tag to needs change, and when they do, reviewers ignore it and continue reviewing it (which is slightly annoying in some cases). > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From stuart at stuartreynolds.net Thu Oct 13 14:14:17 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Thu, 13 Oct 2016 11:14:17 -0700 Subject: [scikit-learn] Missing data and decision trees Message-ID: I'm looking for a decision tree and RF implementation that supports missing data (without imputation) -- ideally in Python, Java/Scala or C++. It seems that scikit's decision tree algorithm doesn't allow this -- which is disappointing because its one of the few methods that should be able to sensibly handle problems with high amounts of missingness. Are there plans to allow missing data in scikit's decision trees? Also, is there any particular reason why missing values weren't supported originally (e.g. integrates poorly with other features) Regards - Stuart -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Oct 13 14:20:34 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 13 Oct 2016 11:20:34 -0700 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: References: Message-ID: I think Raghav is working on it in this PR: https://github.com/scikit-learn/scikit-learn/pull/5974 The reason they weren't initially supported is likely that it involves a lot of work and design choices to handle missing values appropriately, and the discussion on the best way to handle it was postponed until there was a working estimator which could serve most peoples needs. On Thu, Oct 13, 2016 at 11:14 AM, Stuart Reynolds wrote: > I'm looking for a decision tree and RF implementation that supports > missing data (without imputation) -- ideally in Python, Java/Scala or C++. > > It seems that scikit's decision tree algorithm doesn't allow this -- > which is disappointing because its one of the few methods that should be > able to sensibly handle problems with high amounts of missingness. > > Are there plans to allow missing data in scikit's decision trees? > > Also, is there any particular reason why missing values weren't supported > originally (e.g. integrates poorly with other features) > > Regards > - Stuart > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffrey.m.allard at gmail.com Thu Oct 13 14:20:40 2016 From: jeffrey.m.allard at gmail.com (Jeff) Date: Thu, 13 Oct 2016 14:20:40 -0400 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: References: Message-ID: <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com> I ran into this several times as well with scikit-learn implementation of GBM. Look at xgboost if you have not already (is there someone out there that hasn't ? :)- it deals with missing values in the predictor space in a very eloquent manner. http://xgboost.readthedocs.io/en/latest/python/python_intro.html https://arxiv.org/abs/1603.02754 Jeff On 10/13/2016 2:14 PM, Stuart Reynolds wrote: > I'm looking for a decision tree and RF implementation that supports > missing data (without imputation) -- ideally in Python, Java/Scala or > C++. > > It seems that scikit's decision tree algorithm doesn't allow this -- > which is disappointing because its one of the few methods that should > be able to sensibly handle problems with high amounts of missingness. > > Are there plans to allow missing data in scikit's decision trees? > > Also, is there any particular reason why missing values weren't > supported originally (e.g. integrates poorly with other features) > > Regards > - Stuart > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcrudy at gmail.com Thu Oct 13 14:28:02 2016 From: jcrudy at gmail.com (Jason Rudy) Date: Thu, 13 Oct 2016 11:28:02 -0700 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com> References: <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com> Message-ID: It's not a decision tree, but py-earth may also do what you need. It handles missingness as described in section 3.4 here: http://media.salford-systems.com/library/MARS_V2_JHF_LCS-108.pdf. Basically, missingness is considered potentially predictive. On Thu, Oct 13, 2016 at 11:20 AM, Jeff wrote: > I ran into this several times as well with scikit-learn implementation of > GBM. Look at xgboost if you have not already (is there someone out there > that hasn't ? :)- it deals with missing values in the predictor space in a > very eloquent manner. > > http://xgboost.readthedocs.io/en/latest/python/python_intro.html > > https://arxiv.org/abs/1603.02754 > > > Jeff > > > > On 10/13/2016 2:14 PM, Stuart Reynolds wrote: > > I'm looking for a decision tree and RF implementation that supports > missing data (without imputation) -- ideally in Python, Java/Scala or C++. > > It seems that scikit's decision tree algorithm doesn't allow this -- > which is disappointing because its one of the few methods that should be > able to sensibly handle problems with high amounts of missingness. > > Are there plans to allow missing data in scikit's decision trees? > > Also, is there any particular reason why missing values weren't supported > originally (e.g. integrates poorly with other features) > > Regards > - Stuart > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Thu Oct 13 14:33:20 2016 From: drraph at gmail.com (Raphael C) Date: Thu, 13 Oct 2016 19:33:20 +0100 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: References: Message-ID: You can simply make a new binary feature (per feature that might have a missing value) that is 1 if the value is missing and 0 otherwise. The RF can then work out what to do with this information. I don't know how this compares in practice to more sophisticated approaches. Raphael On Thursday, October 13, 2016, Stuart Reynolds wrote: > I'm looking for a decision tree and RF implementation that supports > missing data (without imputation) -- ideally in Python, Java/Scala or C++. > > It seems that scikit's decision tree algorithm doesn't allow this -- > which is disappointing because its one of the few methods that should be > able to sensibly handle problems with high amounts of missingness. > > Are there plans to allow missing data in scikit's decision trees? > > Also, is there any particular reason why missing values weren't supported > originally (e.g. integrates poorly with other features) > > Regards > - Stuart > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Thu Oct 13 14:21:00 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Thu, 13 Oct 2016 18:21:00 +0000 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: References: Message-ID: Please define ?sensibly?. I would be strongly opposed to modifying any models to incorporate ?missingness?. No model handles missing data for you. That is for you to decide based on your individual problem domain. Take a look at a talk from last winter on missing data by Nina Zumel. Nina defines ?sensibly? in several ways. https://www.r-bloggers.com/prepping-data-for-analysis-using-r/ __________________________________________________________________________________________ Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Stuart Reynolds Sent: Thursday, October 13, 2016 2:14 PM To: scikit-learn at python.org Subject: [scikit-learn] Missing data and decision trees ? EXT MSG: I'm looking for a decision tree and RF implementation that supports missing data (without imputation) -- ideally in Python, Java/Scala or C++. It seems that scikit's decision tree algorithm doesn't allow this -- which is disappointing because its one of the few methods that should be able to sensibly handle problems with high amounts of missingness. Are there plans to allow missing data in scikit's decision trees? Also, is there any particular reason why missing values weren't supported originally (e.g. integrates poorly with other features) Regards - Stuart * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Thu Oct 13 16:17:25 2016 From: ragvrv at gmail.com (Raghav R V) Date: Thu, 13 Oct 2016 22:17:25 +0200 Subject: [scikit-learn] Missing data and decision trees In-Reply-To: References: Message-ID: Hi Stuart Reynold, Like Jacob said we have an active PR at https://github.com/scikit-learn/scikit-learn/pull/5974 You could do git fetch https://github.com/raghavrv/scikit-learn.git missing_values_rf:missing_values_rf git checkout missing_values_rf python setup.py install And try it out. I warn you thought, there are some memory leaks I'm trying to debug. But for the most part it works well and outperforms basic imputation techniques. Please let us know if it breaks / not solves your usecase. Your input as a user of that feature would be invaluable! > I ran into this several times as well with scikit-learn implementation of GBM. Look at xgboost if you have not already (is there someone out there that hasn't ? :)- it deals with missing values in the predictor space in a very eloquent manner. http://xgboost.readthedocs.io/ en/latest/python/python_intro.html The PR handles it in a conceptually similar approach. It is currently implemented for DecisionTreeClassifier. After reviews and integration, DecisionTreeRegressor would also be supporting missing values. Once that happens, enabling it in gradient boosting will be possible. Thanks for the interest!! On Thu, Oct 13, 2016 at 8:33 PM, Raphael C wrote: > You can simply make a new binary feature (per feature that might have a > missing value) that is 1 if the value is missing and 0 otherwise. The RF > can then work out what to do with this information. > > I don't know how this compares in practice to more sophisticated > approaches. > > Raphael > > > On Thursday, October 13, 2016, Stuart Reynolds > wrote: > >> I'm looking for a decision tree and RF implementation that supports >> missing data (without imputation) -- ideally in Python, Java/Scala or C++. >> >> It seems that scikit's decision tree algorithm doesn't allow this -- >> which is disappointing because its one of the few methods that should be >> able to sensibly handle problems with high amounts of missingness. >> >> Are there plans to allow missing data in scikit's decision trees? >> >> Also, is there any particular reason why missing values weren't supported >> originally (e.g. integrates poorly with other features) >> >> Regards >> - Stuart >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anael.bonneton at gmail.com Fri Oct 14 09:27:01 2016 From: anael.bonneton at gmail.com (=?UTF-8?Q?Ana=C3=ABl_Bonneton?=) Date: Fri, 14 Oct 2016 15:27:01 +0200 Subject: [scikit-learn] Silhouette example - performance issue Message-ID: Hi, In the silhouette example ( http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py), the silhouette values of each sample is computed twice: once with *silhouette_score *and once with *silhouette_samples.* The call to *silhouette_score* can be easily avoided by computing the average of the result of* silhouette_samples*. Do you think we should remove the call to *silhouette_score* to improve the performance ? Or it is better to keep the two functions to show how to use them ? Ana?l Bonneton -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Fri Oct 14 09:38:55 2016 From: ragvrv at gmail.com (Raghav R V) Date: Fri, 14 Oct 2016 15:38:55 +0200 Subject: [scikit-learn] Silhouette example - performance issue In-Reply-To: References: Message-ID: On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton wrote: > Hi, > > In the silhouette example (http://scikit-learn.org/ > stable/auto_examples/cluster/plot_kmeans_silhouette_ > analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans- > silhouette-analysis-py), the silhouette values of each sample is computed > twice: once with *silhouette_score *and once with *silhouette_samples.* > The call to *silhouette_score* can be easily avoided by computing the > average of the result of* silhouette_samples*. > > Do you think we should remove the call to *silhouette_score* to improve > the performance ? Or it is better to keep the two functions to show how to > use them ? > Hi, When I wrote it, I intended it to be demonstrative of the two methods. Not sure if we should worry about performance issues there -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Fri Oct 14 09:55:25 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Fri, 14 Oct 2016 15:55:25 +0200 Subject: [scikit-learn] Silhouette example - performance issue In-Reply-To: References: Message-ID: Dear Ana?l, if you wish, you could add a line to the example verifying this correspondence. E.g. by moving the print function from between the two silhouette evaluations to after and also evaluating that average and printing it in parentheses. Probably not necessary though. A comment would do also. Or nothing :) Michael On Fri, Oct 14, 2016 at 3:38 PM, Raghav R V wrote: > On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton > wrote: > >> Hi, >> >> In the silhouette example (http://scikit-learn.org/stabl >> e/auto_examples/cluster/plot_kmeans_silhouette_analysis. >> html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py), >> the silhouette values of each sample is computed twice: once with *silhouette_score >> *and once with *silhouette_samples.* The call to *silhouette_score* can >> be easily avoided by computing the average of the result of* >> silhouette_samples*. >> >> Do you think we should remove the call to *silhouette_score* to improve >> the performance ? Or it is better to keep the two functions to show how to >> use them ? >> > Hi, > > When I wrote it, I intended it to be demonstrative of the two methods. > > Not sure if we should worry about performance issues there > > > -- > Raghav RV > https://github.com/raghavrv > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.to.the.k at gmail.com Mon Oct 17 07:36:32 2016 From: tom.to.the.k at gmail.com (Tomas Karasek) Date: Mon, 17 Oct 2016 14:36:32 +0300 Subject: [scikit-learn] Search in results for optimal parameter subsets Message-ID: Hey, I have dataframe with test results (10k rows). Each result (row) has ~6 parameters, plus some output metrics. I would like to find combinations of the parameters which have reasonable mean, std and support-count (number of results in the configuration). E.g. if there are parameters "k" and "n", each in range(100), and the result metric has good mean and std for "k in [4..12] and c in [90..95]" (support-count for this would be 8*5 = 40) and then maybe "k in [34..41] and c in [10..13] (s-c is 7*3=21), then I would like to have the algorithm return sth like k c mean std support-count total_score 4..12 90..95 12.1 1.23 40 9.3 34..41 10..13 11.1 1.13 21 6.2 I understand I will first have to define a fucntion that will reduce the mean, std and count to the total_score. I can do that somehow. But I don't know what kind of math task is finding the local maxima of parameter configuration subsets. Is this optimization task? Can you please point me to sth in sklearn or scipy, that would give me some direction? Cheers, Tomas From joel.nothman at gmail.com Tue Oct 18 05:36:38 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 18 Oct 2016 20:36:38 +1100 Subject: [scikit-learn] Silhouette example - performance issue In-Reply-To: References: Message-ID: And we can reduce any substantial performance issues by merging https://github.com/scikit-learn/scikit-learn/pull/7177 ... :) On 15 October 2016 at 00:55, Michael Eickenberg < michael.eickenberg at gmail.com> wrote: > Dear Ana?l, > > if you wish, you could add a line to the example verifying this > correspondence. E.g. by moving the print function from between the two > silhouette evaluations to after and also evaluating that average and > printing it in parentheses. > > Probably not necessary though. A comment would do also. Or nothing :) > > Michael > > > On Fri, Oct 14, 2016 at 3:38 PM, Raghav R V wrote: > >> On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton > > wrote: >> >>> Hi, >>> >>> In the silhouette example (http://scikit-learn.org/stabl >>> e/auto_examples/cluster/plot_kmeans_silhouette_analysis.html >>> #sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py), >>> the silhouette values of each sample is computed twice: once with *silhouette_score >>> *and once with *silhouette_samples.* The call to *silhouette_score* can >>> be easily avoided by computing the average of the result of* >>> silhouette_samples*. >>> >>> Do you think we should remove the call to *silhouette_score* to improve >>> the performance ? Or it is better to keep the two functions to show how to >>> use them ? >>> >> Hi, >> >> When I wrote it, I intended it to be demonstrative of the two methods. >> >> Not sure if we should worry about performance issues there >> >> >> -- >> Raghav RV >> https://github.com/raghavrv >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Oct 19 10:42:31 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 20 Oct 2016 01:42:31 +1100 Subject: [scikit-learn] Towards 0.18.1 Message-ID: Due to a few substantial bugs in 0.18.0, we're hoping to release 0.18.1 around the end of the month. Help solving (and reviewing) the issues listed https://github.com/scikit-learn/scikit-learn/milestone/22 is welcome. In particular, an easy documentation issue at https://github.com/scikit-learn/scikit-learn/pull/7659 is waiting to be picked up. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brookm291 at gmail.com Sat Oct 22 13:32:04 2016 From: brookm291 at gmail.com (KevNo) Date: Sun, 23 Oct 2016 02:32:04 +0900 Subject: [scikit-learn] Recurrent States with Decision Tree Message-ID: <580BA294.30406@gmail.com> Hello, Just wondering, how can we setup path dependant input states for Random Forest/Decision Tree ? This is similar to Recurrent Network, where input Xt=(x0,...,xi,.. yt-1, yt-2) depends on past output states Yt. If we could put the exact values states Yt, it obviously creates a bias in the training. So, we should be put some estimate of (Yt) Is the concept of Recurrent Tree makes sense ? Thanks for your insight. > scikit-learn-request at python.org > Thursday, October 20, 2016 1:00 AM > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Towards 0.18.1 (Joel Nothman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 20 Oct 2016 01:42:31 +1100 > From: Joel Nothman > To: Scikit-learn user and developer mailing list > > Subject: [scikit-learn] Towards 0.18.1 > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Due to a few substantial bugs in 0.18.0, we're hoping to release 0.18.1 > around the end of the month. Help solving (and reviewing) the issues > listed > https://github.com/scikit-learn/scikit-learn/milestone/22 is welcome. In > particular, an easy documentation issue at > https://github.com/scikit-learn/scikit-learn/pull/7659 is waiting to be > picked up. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 7, Issue 39 > ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: compose-unknown-contact.jpg Type: image/jpeg Size: 770 bytes Desc: not available URL: From rdslater at gmail.com Sun Oct 23 18:37:30 2016 From: rdslater at gmail.com (Robert Slater) Date: Sun, 23 Oct 2016 17:37:30 -0500 Subject: [scikit-learn] Random Forest with Mean Absolute Error Message-ID: I searched the archives to see if this was a known issue, but could not seem to find anyone else having the problem. Nevertheless, in the latest version (0.18) Random Forest Regressor has the new option of 'mae' for criterion. However it appears to run disporportinally longer than the 'mse' critera, For example: from sklearn.ensemble import RandomForestRegressor rf_tree=50 rf_depth=5 rf=RandomForestRegressor(n_estimators=rf_tree, criterion='mae', max_depth=rf_depth, min_samples_split=4, min_samples_leaf=2, max_features=0.5, max_leaf_nodes=5, oob_score=True, n_jobs=1, random_state=0, verbose=1) from sklearn.ensemble import ExtraTreesRegressor et_tree=100 et=ExtraTreesRegressor(n_estimators=et_tree,max_depth=5,min_samples_split=4, min_samples_leaf=2,max_features=0.5,verbose=1,n_jobs=4) from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error X_train, X_test, y_train, y_test = train_test_split(train, loss, test_size=0.2, random_state=42) et.fit(X_train,y_train) rf.fit(X_train,y_train) rf_pred=rf.predict(X_test) et_pred=et.predict(X_test) print(mean_absolute_error(y_test,rf_pred)) print(mean_absolute_error(y_test,et_pred)) I was using these two for a recent Kaggle competition. If I use "criterion='mse'" in the Random forest it takes around 1 min to build. Switching to 'mae' causes 100% CPU usage and 30 minutes (at least) if wait time before I kill my kernel. Not sure if the problem is on my end or if there is a real issue so I wanted to reach out and see if there or others. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Sun Oct 23 18:45:09 2016 From: nfliu at uw.edu (Nelson Liu) Date: Sun, 23 Oct 2016 15:45:09 -0700 Subject: [scikit-learn] Random Forest with Mean Absolute Error In-Reply-To: References: Message-ID: Hi Robert, Thanks for the report. This is definitely not something just on your end; MAE does run longer than MSE, especially on larger datasets, due to the need to find the median of data for MAE (expensive) vs the mean of data for MSE (not so expensive). We've used a variety of tricks to try to make it faster for growing trees, but it still seems like it is quite slow for these larger datasets. I've been working on a patch to speed it up by using a binary mask to further reduce the amount of computation MAE needs per split, but I've been bogged down with real life recently and haven't had a chance to wrap it up. Nelson Liu On Sun, Oct 23, 2016 at 3:37 PM, Robert Slater wrote: > I searched the archives to see if this was a known issue, but could not > seem to find anyone else having the problem. > > Nevertheless, in the latest version (0.18) Random Forest Regressor has the > new option of 'mae' for criterion. However it appears to run > disporportinally longer than the 'mse' critera, > > For example: > > from sklearn.ensemble import RandomForestRegressor > rf_tree=50 > rf_depth=5 > rf=RandomForestRegressor(n_estimators=rf_tree, criterion='mae', > max_depth=rf_depth, > min_samples_split=4, min_samples_leaf=2, > max_features=0.5, > max_leaf_nodes=5, > oob_score=True, n_jobs=1, random_state=0, > verbose=1) > > from sklearn.ensemble import ExtraTreesRegressor > et_tree=100 > et=ExtraTreesRegressor(n_estimators=et_tree,max_depth=5,min_samples_split=4, > min_samples_leaf=2,max_features=0.5,verbose=1,n_jobs=4) > > from sklearn.model_selection import train_test_split > from sklearn.metrics import mean_absolute_error > X_train, X_test, y_train, y_test = train_test_split(train, loss, > test_size=0.2, random_state=42) > > et.fit(X_train,y_train) > rf.fit(X_train,y_train) > > rf_pred=rf.predict(X_test) > et_pred=et.predict(X_test) > > print(mean_absolute_error(y_test,rf_pred)) > print(mean_absolute_error(y_test,et_pred)) > > I was using these two for a recent Kaggle competition. If I use > "criterion='mse'" in the Random forest it takes around 1 min to build. > Switching to 'mae' causes 100% CPU usage and 30 minutes (at least) if wait > time before I kill my kernel. > > Not sure if the problem is on my end or if there is a real issue so I > wanted to reach out and see if there or others. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aakash at klugtek.co.in Mon Oct 24 09:53:02 2016 From: aakash at klugtek.co.in (Aakash Agarwal) Date: Mon, 24 Oct 2016 19:23:02 +0530 Subject: [scikit-learn] NER Tagged Data Message-ID: Hi All. I am trying to implement NER Algo using CRF data. Can anyone point me to some tagged data which i used. I am not able to use coNLL data (2003), i got the tagged data but the words are missing. I downloaded rcv1 dataset but still could not generate training and testing data. I would be grateful if anybody can help me. Thanks in advance! Aakash -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg315 at hotmail.fr Mon Oct 24 10:18:11 2016 From: greg315 at hotmail.fr (greg g) Date: Mon, 24 Oct 2016 14:18:11 +0000 Subject: [scikit-learn] tree visualization with class names in leaves Message-ID: Hi, ?I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc? I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?) After that I have correct predictions using predict() Then I use the function export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) with FEATURES being the array of my n2 features names in the same order as in X I obtain the tree .png but can?t find a way to have the correct class names in the leaves? In export_graphviz() should I use the class_names optional parameter and how ? Thanks for any help ? Gregory, Toulouse FRANCE From se.raschka at gmail.com Mon Oct 24 11:47:23 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 24 Oct 2016 11:47:23 -0400 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: Message-ID: Hi, Greg, if you provide the class_names argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. Best, Sebastian > On Oct 24, 2016, at 10:18 AM, greg g wrote: > > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4 > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; > SRVR:DB5EUR03HT168; > x-forefront-prvs: 0105DAA385 > X-OriginatorOrg: outlook.com > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC) > X-MS-Exchange-CrossTenant-fromentityheader: Internet > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 > > > Hi, > I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc? > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?) > After that I have correct predictions using predict() > Then I use the function > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) > with FEATURES being the array of my n2 features names in the same order as in X > I obtain the tree .png but can?t find a way to have the correct class names in the leaves? > In export_graphviz() should I use the class_names optional parameter and how ? > Thanks for any help > > Gregory, Toulouse FRANCE > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From blrstartuphire at gmail.com Tue Oct 25 02:41:08 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Tue, 25 Oct 2016 12:11:08 +0530 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: Message-ID: Hi all, Thanks for the suggestion. I have a related question on tree visualization I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when I load the dataset) I have given the class_names as "NotPresent" and "Ispresent" which I believe it will map to 0 and 1. is that correct? How do I interpret the nodes and value present in each nodes in the accompanying diagram? Regards, Sanant On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka wrote: > Hi, Greg, > if you provide the class_names argument, a ?class? label of the majority > class will be added at the bottom of each node. For instance, if you have > the Iris dataset, with class labels 0, 1, 2, you can provide the > class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> > ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. > > Best, > Sebastian > > > On Oct 24, 2016, at 10:18 AM, greg g wrote: > > > > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; > SFS:(10019020)(98900003); > > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; > > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; > > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a- > 08d3fc1895c4 > > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; > > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; > > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; > > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; > > SRVR:DB5EUR03HT168; > > x-forefront-prvs: 0105DAA385 > > X-OriginatorOrg: outlook.com > > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 > 14:18:11.0102 (UTC) > > X-MS-Exchange-CrossTenant-fromentityheader: Internet > > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa > > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 > > > > > > Hi, > > I just begin with scikit-learn and would like to visualize a > classification tree with class names displayed in the leaves as shown in > the SCIKITLEARN.TREE documentation http://scikit-learn.org/ > stable/modules/tree.html where we find class=?virginica? etc? > > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D > array Y (n1 corresponding classes ) such that Y(i) is the class of the > sample X(i, ?) > > After that I have correct predictions using predict() > > Then I use the function > > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) > > with FEATURES being the array of my n2 features names in the same order > as in X > > I obtain the tree .png but can?t find a way to have the correct class > names in the leaves? > > In export_graphviz() should I use the class_names optional parameter and > how ? > > Thanks for any help > > > > Gregory, Toulouse FRANCE > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Extra Decision tree.png Type: image/png Size: 145232 bytes Desc: not available URL: From greg315 at hotmail.fr Tue Oct 25 03:00:09 2016 From: greg315 at hotmail.fr (greg g) Date: Tue, 25 Oct 2016 07:00:09 +0000 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: , Message-ID: Hi Sebastian, Thanks for your answer. I dont't use the iris dataset. My classes are distributed in my Y array. It seems that I can get the classes in alphabetical order with > clf.classes_ where clf is my tree. And with > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES,class_names=clf.classes_) the nodes of the graphical tree seem to be filled with the predominant class and samples repartition in a vector with the classes in alphabetical order ( the same order as in clf.classes_) I have to confirm that with more classes. Regards Gregory ________________________________ De : scikit-learn de la part de Sebastian Raschka Envoy? : lundi 24 octobre 2016 17:47 ? : Scikit-learn user and developer mailing list Objet : Re: [scikit-learn] tree visualization with class names in leaves Hi, Greg, if you provide the class_names argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. Best, Sebastian > On Oct 24, 2016, at 10:18 AM, greg g wrote: > > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4 > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; > SRVR:DB5EUR03HT168; > x-forefront-prvs: 0105DAA385 > X-OriginatorOrg: outlook.com > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC) > X-MS-Exchange-CrossTenant-fromentityheader: Internet > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 > > > Hi, > I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc? [http://scikit-learn.org/stable/_images/iris.svg] 1.10. Decision Trees ? scikit-learn 0.18 documentation scikit-learn.org Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently ... > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?) > After that I have correct predictions using predict() > Then I use the function > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) > with FEATURES being the array of my n2 features names in the same order as in X > I obtain the tree .png but can?t find a way to have the correct class names in the leaves? > In export_graphviz() should I use the class_names optional parameter and how ? > Thanks for any help > > Gregory, Toulouse FRANCE > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Tue Oct 25 05:32:21 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Tue, 25 Oct 2016 09:32:21 +0000 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: Message-ID: Hi Sanant, the values represent the thresholds at the current feature (node), which are used to classify the next sample. You can see an example here: http://scikit-learn.org/stable/modules/tree.html The first node uses the feature "petal length (cm)" with a threshold of 2.45. If your future sample as a petal length <= 2.45cm it will be pushed into the left branch and therefore will be classifies as class = setosa. However, if the petal length is > 2.45cm, it will be pushed into the right branch and the next node (feature) is evalueted. I hope I understood your question correct. Best regards, Piotr On 25.10.2016 08:41, Startup Hire wrote: Hi all, Thanks for the suggestion. I have a related question on tree visualization I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when I load the dataset) I have given the class_names as "NotPresent" and "Ispresent" which I believe it will map to 0 and 1. is that correct? How do I interpret the nodes and value present in each nodes in the accompanying diagram? Regards, Sanant On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka <se.raschka at gmail.com> wrote: Hi, Greg, if you provide the class_names argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. Best, Sebastian > On Oct 24, 2016, at 10:18 AM, greg g <greg315 at hotmail.fr> wrote: > > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4 > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; > SRVR:DB5EUR03HT168; > x-forefront-prvs: 0105DAA385 > X-OriginatorOrg: outlook.com > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC) > X-MS-Exchange-CrossTenant-fromentityheader: Internet > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 > > > Hi, > I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc? > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?) > After that I have correct predictions using predict() > Then I use the function > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) > with FEATURES being the array of my n2 features names in the same order as in X > I obtain the tree .png but can?t find a way to have the correct class names in the leaves? > In export_graphviz() should I use the class_names optional parameter and how ? > Thanks for any help > > Gregory, Toulouse FRANCE > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From blrstartuphire at gmail.com Tue Oct 25 08:15:15 2016 From: blrstartuphire at gmail.com (Startup Hire) Date: Tue, 25 Oct 2016 17:45:15 +0530 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: Message-ID: Hi Piotr, Thanks for the reply. I understand the thresholds at the current node. I was referring to this: Consider the node: Duration <= 0.5 having gini = 0.3386 and samples = 327510 What is meant by this: value = [216974.9673, 59743.3314] Regards, Sanant On Tue, Oct 25, 2016 at 3:02 PM, Piotr Bialecki wrote: > Hi Sanant, > > the values represent the thresholds at the current feature (node), which > are used to classify the next sample. > > You can see an example here: > http://scikit-learn.org/stable/modules/tree.html > > The first node uses the feature "petal length (cm)" with a threshold of > 2.45. > > If your future sample as a petal length <= 2.45cm it will be pushed into > the left branch and therefore will be classifies as class = setosa. > However, if the petal length is > 2.45cm, it will be pushed into the right > branch and the next node (feature) is evalueted. > > I hope I understood your question correct. > > > Best regards, > Piotr > > > > > On 25.10.2016 08:41, Startup Hire wrote: > > Hi all, > > Thanks for the suggestion. > > I have a related question on tree visualization > > I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when > I load the dataset) > > I have given the class_names as "NotPresent" and "Ispresent" which I > believe it will map to 0 and 1. is that correct? > > > How do I interpret the nodes and value present in each nodes in the > accompanying diagram? > > Regards, > Sanant > > > > > On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > >> Hi, Greg, >> if you provide the class_names argument, a ?class? label of the >> majority class will be added at the bottom of each node. For instance, if >> you have the Iris dataset, with class labels 0, 1, 2, you can provide the >> class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> >> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. >> >> Best, >> Sebastian >> >> > On Oct 24, 2016, at 10:18 AM, greg g < >> greg315 at hotmail.fr> wrote: >> > >> > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; >> SFS:(10019020)(98900003); >> > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; >> > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; >> > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc >> 1895c4 >> > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; >> > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; >> > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; >> > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; >> > SRVR:DB5EUR03HT168; >> > x-forefront-prvs: 0105DAA385 >> > X-OriginatorOrg: outlook.com >> > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 >> 14:18:11.0102 (UTC) >> > X-MS-Exchange-CrossTenant-fromentityheader: Internet >> > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa >> > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 >> > >> > >> > Hi, >> > I just begin with scikit-learn and would like to visualize a >> classification tree with class names displayed in the leaves as shown in >> the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable >> /modules/tree.html where we find class=?virginica? etc? >> > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D >> array Y (n1 corresponding classes ) such that Y(i) is the class of the >> sample X(i, ?) >> > After that I have correct predictions using predict() >> > Then I use the function >> > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) >> > with FEATURES being the array of my n2 features names in the same order >> as in X >> > I obtain the tree .png but can?t find a way to have the correct class >> names in the leaves? >> > In export_graphviz() should I use the class_names optional parameter >> and how ? >> > Thanks for any help >> > >> > Gregory, Toulouse FRANCE >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Tue Oct 25 08:45:45 2016 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 25 Oct 2016 08:45:45 -0400 Subject: [scikit-learn] tree visualization with class names in leaves In-Reply-To: References: Message-ID: Hi, Gregory, > I dont't use the iris dataset. My classes are distributed in my Y array. Yeah, I just used this as a simple example :). > the nodes of the graphical tree seem to be filled with the predominant class I think that?s right, it gets the class name of the majority class at each node via "class_name = class_names[np.argmax(value)]? (https://github.com/scikit-learn/scikit-learn/blob/3a106fc792eb8e70e1fd078e351ba42487d3214d/sklearn/tree/export.py#L286) > in a vector with the classes in alphabetical order ( the same order as in clf.classes_) yes, it should be in ascending, alpha numerical order. Not sure if this is still a general recommendation in the sklearn 0.18, but I typically convert string class labels to integers before I feed it to a classifier (but it seems to work either way now) -> from sklearn.preprocessing import LabelEncoder -> le = LabelEncoder() -> y = le.fit_transform(labels) -> le.classes_ array(['Setosa', 'Versicolor', 'Virginica'], dtype=' import numpy as np -> np.bincount(y) array([50, 50, 50]) Best, Sebastian > On Oct 25, 2016, at 3:00 AM, greg g wrote: > > Hi Sebastian, > Thanks for your answer. > I dont't use the iris dataset. My classes are distributed in my Y array. > It seems that I can get the classes in alphabetical order with > > clf.classes_ > where clf is my tree. > And with > > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES,class_names=clf.classes_) > the nodes of the graphical tree seem to be filled with the predominant class and samples repartition in a vector with the classes in alphabetical order ( the same order as in clf.classes_) > I have to confirm that with more classes. > > Regards > Gregory > > De : scikit-learn de la part de Sebastian Raschka > Envoy? : lundi 24 octobre 2016 17:47 > ? : Scikit-learn user and developer mailing list > Objet : Re: [scikit-learn] tree visualization with class names in leaves > > Hi, Greg, > if you provide the class_names argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the class_names as ['setosa', 'versicolor', 'virginica?], where 0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?. > > Best, > Sebastian > > > On Oct 24, 2016, at 10:18 AM, greg g wrote: > > > > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); > > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168; > > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en; > > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4 > > x-microsoft-antispam: UriScan:; BCL:0; PCL:0; > > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168; > > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; > > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:; > > SRVR:DB5EUR03HT168; > > x-forefront-prvs: 0105DAA385 > > X-OriginatorOrg: outlook.com > > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC) > > X-MS-Exchange-CrossTenant-fromentityheader: Internet > > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa > > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168 > > > > > > Hi, > > I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc? > > 1.10. Decision Trees ? scikit-learn 0.18 documentation > scikit-learn.org > Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently ... > > > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?) > > After that I have correct predictions using predict() > > Then I use the function > > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES) > > with FEATURES being the array of my n2 features names in the same order as in X > > I obtain the tree .png but can?t find a way to have the correct class names in the leaves? > > In export_graphviz() should I use the class_names optional parameter and how ? > > Thanks for any help > > > > Gregory, Toulouse FRANCE > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > scikit-learn Info Page - Python > mail.python.org > To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > scikit-learn Info Page - Python > mail.python.org > To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From yafc18 at gmail.com Tue Oct 25 22:42:40 2016 From: yafc18 at gmail.com (=?UTF-8?B?6aKc5Y+R5omNKFlhbiBGYWNhaSk=?=) Date: Wed, 26 Oct 2016 10:42:40 +0800 Subject: [scikit-learn] The implementation of gradient_boost.py:BinomialDeviance? Message-ID: Hi, which paper or book is the foundation of the implementation of gradient_boost.py:BinomialDeviance? I recently read the paper: Friedman: greedy function approximation - a gradient boosting machine. I believe that L2_TreeBoost in the paper should be equivalent to BinomialDeviance in scikit-learn, while their implementation are different, for example: + negative_gradient: - in scikit: \tilde{y} = y - expit(pred.ravel()) = y - \frac{1}{1 + exp(- F)} - in paper: \tilde{y} = \frac{2 y}{1 + exp(2yF)} Does anyone can help me? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Wed Oct 26 11:26:20 2016 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Wed, 26 Oct 2016 11:26:20 -0400 Subject: [scikit-learn] Calculating prediction probability per each predicted outcome Message-ID: Hi everyone, I'm currently using Scikit learn to train and test multiple neural networks. My issue - I'm breaking my dataset into 90/10, training on the 90%, and testing on the 10%. For the 10% trained data, I get outcomes as follows: predicted = neural_network.predict(test_data) Here, the predicted variable is basically either 1 or 0, which is what i'm feeding in as the outcome. But how can I get the prediction probability per each predicted outcome? back in the day when I used weka it produced a single prediction, followed by a prediction probability between 1 and 0 for each outcome. -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Wed Oct 26 11:35:25 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Wed, 26 Oct 2016 15:35:25 +0000 Subject: [scikit-learn] Calculating prediction probability per each predicted outcome In-Reply-To: References: Message-ID: Hi Suranga, if you are using the MLPClassifier class, it should have a predict_proba() method. Try: predicted = neural_network.predict_proba(test_data) Best regards, Piotr On 26.10.2016 17:26, Suranga Kasthurirathne wrote: Hi everyone, I'm currently using Scikit learn to train and test multiple neural networks. My issue - I'm breaking my dataset into 90/10, training on the 90%, and testing on the 10%. For the 10% trained data, I get outcomes as follows: predicted = neural_network.predict(test_data) Here, the predicted variable is basically either 1 or 0, which is what i'm feeding in as the outcome. But how can I get the prediction probability per each predicted outcome? back in the day when I used weka it produced a single prediction, followed by a prediction probability between 1 and 0 for each outcome. -- Best Regards, Suranga _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Fri Oct 28 08:13:31 2016 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Fri, 28 Oct 2016 08:13:31 -0400 Subject: [scikit-learn] Calculating prediction probability per each predicted outcome In-Reply-To: References: Message-ID: Thanks Piotr, this was indeed the case. Works for me now :) On Wed, Oct 26, 2016 at 11:26 AM, Suranga Kasthurirathne < surangakas at gmail.com> wrote: > > Hi everyone, > > I'm currently using Scikit learn to train and test multiple neural > networks. > > My issue - I'm breaking my dataset into 90/10, training on the 90%, and > testing on the 10%. > > For the 10% trained data, I get outcomes as follows: > > predicted = neural_network.predict(test_data) > > Here, the predicted variable is basically either 1 or 0, which is what i'm > feeding in as the outcome. > > But how can I get the prediction probability per each predicted outcome? > back in the day when I used weka it produced a single prediction, followed > by a prediction probability between 1 and 0 for each outcome. > > > -- > Best Regards, > Suranga > -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Sun Oct 30 09:17:56 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 30 Oct 2016 14:17:56 +0100 Subject: [scikit-learn] Module Level Labels? Message-ID: Hi all, Should we have module level labels? "mod: tree" "mod: model_selection" "mod: linear_models" "mod: ..." I know it will blow up our label count, but I think it will help filter issues / PRs to review. Sometimes I like to look into issues / PRs that concern my two fav. modules "model_selection" and "tree" and I have to resort to some complex searches to cover all possible key words. (Sometimes the OP does not use the right keywords in their issue / PR) (From time to time I may keep popping some crazy suggestions, please feel free to shoot them down) Have a good weekend!! Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Sun Oct 30 11:52:42 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Sun, 30 Oct 2016 08:52:42 -0700 Subject: [scikit-learn] Module Level Labels? In-Reply-To: References: Message-ID: Hello, I personnally don't think it is useful and it clutters the UI with information. I am actually trying to reduce matplotlib's number of labels right now, as we have so many that they are useless. Cheers, N On 30 October 2016 at 06:17, Raghav R V wrote: > Hi all, > > Should we have module level labels? > > "mod: tree" > "mod: model_selection" > "mod: linear_models" > "mod: ..." > > I know it will blow up our label count, but I think it will help filter > issues / PRs to review. > > Sometimes I like to look into issues / PRs that concern my two fav. modules > "model_selection" and "tree" and I have to resort to some complex searches > to cover all possible key words. (Sometimes the OP does not use the right > keywords in their issue / PR) > > (From time to time I may keep popping some crazy suggestions, please feel > free to shoot them down) > > Have a good weekend!! > > > Raghav RV > https://github.com/raghavrv > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From surangakas at gmail.com Sun Oct 30 15:24:12 2016 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Sun, 30 Oct 2016 12:24:12 -0700 Subject: [scikit-learn] Problem using boxplots to compare significance of model performance Message-ID: Hi folks! I'm using scikit-learn to build two neural networks using 10% holdout, and compare their performance using precision. To compare statistical significance in the variance of precision, i'm using scikit's boxplots. My problem is twofold - 1) The standard deviation in the precision of the two models (obtained using precision.std()) is always 0.0. I'm assuming that's a problem. 2) My boxplot is meant to display bars for the two models, but always displays only the first model (nn01) My outcomes for this dataset is binary (0 or 1) since the models assume average=binary by default, is that a problem? For those who'd like to look, my source code can be seen at http://pastebin.com/yvE2T1Sw The code produces the following plot - which is of course only ONE of the bars that I need :( ? -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-10-30 at 12.17.22 PM.png Type: image/png Size: 45270 bytes Desc: not available URL: From se.raschka at gmail.com Sun Oct 30 15:56:21 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 30 Oct 2016 15:56:21 -0400 Subject: [scikit-learn] Problem using boxplots to compare significance of model performance In-Reply-To: References: Message-ID: Hi, Suranga, > 1) The standard deviation in the precision of the two models (obtained using precision.std()) is always 0.0. I'm assuming that's a problem. That?s weird. You are sure that ?precision? has more than one value? E.g., >>> np.array([0.89]).std() 0.0 > 2) My boxplot is meant to display bars for the two models, but always displays only the first model (nn01) Also here, your input array or list for the boxplot function may not be formatted correctly. What you want is two_models = [ 1Darray_of_model1_results, 1Darray_of_model2_results ] plt.boxplot(two_models, notch=False, # box instead of notch shape sym='rs', # red squares for outliers vert=True) # vertical box aligmnent PS: If you are comparing specifically 2 neural network models, have you considered McNemar?s test? E.g., see https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/evaluate/mcnemar.ipynb Best Sebastian > On Oct 30, 2016, at 3:24 PM, Suranga Kasthurirathne wrote: > > > Hi folks! > > I'm using scikit-learn to build two neural networks using 10% holdout, and compare their performance using precision. To compare statistical significance in the variance of precision, i'm using scikit's boxplots. > > My problem is twofold - > > 1) The standard deviation in the precision of the two models (obtained using precision.std()) is always 0.0. I'm assuming that's a problem. > 2) My boxplot is meant to display bars for the two models, but always displays only the first model (nn01) > > My outcomes for this dataset is binary (0 or 1) since the models assume average=binary by default, is that a problem? > > For those who'd like to look, my source code can be seen at http://pastebin.com/yvE2T1Sw > > The code produces the following plot - which is of course only ONE of the bars that I need :( > > > > ? > > -- > Best Regards, > Suranga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From surangakas at gmail.com Sun Oct 30 16:43:13 2016 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Sun, 30 Oct 2016 13:43:13 -0700 Subject: [scikit-learn] Problem using boxplots to compare significance of model performance Message-ID: Hi Sebastian! Thank you, you might be onto something here ;) So, I may have to go over 2 models, so McNamara's may not be an option :( In regard to your second comment, in building my boxplots, this is how I input results. plt.boxplot(results) So what does "results" look like? [0.85433808345719897, 0.8976733724549345] These are the two precision values calculated for each neural network. Exactly what should 1Darray_of_model1_results look like? is it one value per model or.... -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Oct 30 17:38:18 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 30 Oct 2016 17:38:18 -0400 Subject: [scikit-learn] Problem using boxplots to compare significance of model performance In-Reply-To: References: Message-ID: <818ED03F-6F00-4F2C-9FBB-1B79E3E2ED34@gmail.com> Hi, Suranga > So, I may have to go over 2 models, so McNamara's may not be an option :( Sure, but there are many other hypothesis tests, was just a suggestion since I thought you just wanted compare 2 models :) > plt.boxplot(results) > So what does "results" look like? > > [0.85433808345719897, 0.8976733724549345] You can?t do a boxplot based on 1 single value. > These are the two precision values calculated for each neural network. Exactly what should 1Darray_of_model1_results look like? is it one value per model or.... This should work: model_1 = [0.85, # experiment 1 0.84] # experiment 2 model_2 = [0.84, # experiment 1 0.83] # experiment 2 plt.boxplot([model_1, model_2]) However, a boxplot based on 2 values only doesn?t make sense imho, I you could just plot the range. Best, Sebastian > On Oct 30, 2016, at 4:43 PM, Suranga Kasthurirathne wrote: > > > Hi Sebastian! > > Thank you, you might be onto something here ;) > > So, I may have to go over 2 models, so McNamara's may not be an option :( > > In regard to your second comment, in building my boxplots, this is how I input results. > > plt.boxplot(results) > So what does "results" look like? > > [0.85433808345719897, 0.8976733724549345] > > These are the two precision values calculated for each neural network. Exactly what should 1Darray_of_model1_results look like? is it one value per model or.... > > > -- > Best Regards, > Suranga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From yafc18 at gmail.com Sun Oct 30 20:02:34 2016 From: yafc18 at gmail.com (=?UTF-8?B?6aKc5Y+R5omNKFlhbiBGYWNhaSk=?=) Date: Mon, 31 Oct 2016 08:02:34 +0800 Subject: [scikit-learn] The implementation of gradient_boost.py:BinomialDeviance? In-Reply-To: References: Message-ID: Does anyone can help me? Thanks. On Wed, Oct 26, 2016 at 10:42 AM, ???(Yan Facai) wrote: > Hi, > which paper or book is the foundation of the implementation of > gradient_boost.py:BinomialDeviance? > > I recently read the paper: Friedman: greedy function approximation - a > gradient boosting machine. I believe that L2_TreeBoost in the paper should > be equivalent to BinomialDeviance in scikit-learn, while their > implementation are different, for example: > > + negative_gradient: > - in scikit: \tilde{y} = y - expit(pred.ravel()) > = y - \frac{1}{1 + exp(- F)} > - in paper: \tilde{y} = \frac{2 y}{1 + exp(2yF)} > > Does anyone can help me? > Thanks. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Oct 31 11:32:16 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 31 Oct 2016 11:32:16 -0400 Subject: [scikit-learn] Module Level Labels? In-Reply-To: References: Message-ID: I think it would be more helpful to lower the issue count by solving issues and closing non-helpful ones ;) On 10/30/2016 11:52 AM, Nelle Varoquaux wrote: > Hello, > > I personnally don't think it is useful and it clutters the UI with information. > I am actually trying to reduce matplotlib's number of labels right > now, as we have so many that they are useless. > > Cheers, > N > > On 30 October 2016 at 06:17, Raghav R V wrote: >> Hi all, >> >> Should we have module level labels? >> >> "mod: tree" >> "mod: model_selection" >> "mod: linear_models" >> "mod: ..." >> >> I know it will blow up our label count, but I think it will help filter >> issues / PRs to review. >> >> Sometimes I like to look into issues / PRs that concern my two fav. modules >> "model_selection" and "tree" and I have to resort to some complex searches >> to cover all possible key words. (Sometimes the OP does not use the right >> keywords in their issue / PR) >> >> (From time to time I may keep popping some crazy suggestions, please feel >> free to shoot them down) >> >> Have a good weekend!! >> >> >> Raghav RV >> https://github.com/raghavrv >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ragvrv at gmail.com Mon Oct 31 12:04:04 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 31 Oct 2016 17:04:04 +0100 Subject: [scikit-learn] Module Level Labels? In-Reply-To: References: Message-ID: Okay! Thanks for the replies Nelle and Andy! On Mon, Oct 31, 2016 at 4:32 PM, Andreas Mueller wrote: > I think it would be more helpful to lower the issue count by solving > issues and closing non-helpful ones ;) > > > On 10/30/2016 11:52 AM, Nelle Varoquaux wrote: > >> Hello, >> >> I personnally don't think it is useful and it clutters the UI with >> information. >> I am actually trying to reduce matplotlib's number of labels right >> now, as we have so many that they are useless. >> >> Cheers, >> N >> >> On 30 October 2016 at 06:17, Raghav R V wrote: >> >>> Hi all, >>> >>> Should we have module level labels? >>> >>> "mod: tree" >>> "mod: model_selection" >>> "mod: linear_models" >>> "mod: ..." >>> >>> I know it will blow up our label count, but I think it will help filter >>> issues / PRs to review. >>> >>> Sometimes I like to look into issues / PRs that concern my two fav. >>> modules >>> "model_selection" and "tree" and I have to resort to some complex >>> searches >>> to cover all possible key words. (Sometimes the OP does not use the right >>> keywords in their issue / PR) >>> >>> (From time to time I may keep popping some crazy suggestions, please feel >>> free to shoot them down) >>> >>> Have a good weekend!! >>> >>> >>> Raghav RV >>> https://github.com/raghavrv >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From sumeet.k.sandhu at gmail.com Mon Oct 31 16:28:43 2016 From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu) Date: Mon, 31 Oct 2016 13:28:43 -0700 Subject: [scikit-learn] creating a custom scoring function for cross-validation of classification Message-ID: Hi, I've been staring at various doc pages for a while to create a custom scorer that uses predict_proba output of a multi-class SGDClassifier : http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer I got the impression I could customize the "scoring'' parameter in cross_val_score directly, but that didn't work. Then I tried customizing the "score_func" parameter in make_scorer, but that didn't work either. Both errors are ValuErrors : Traceback (most recent call last): File "", line 3, in accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs, trainLabelVecs, cv=10, scoring = 'topNscorer')) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1425, in cross_val_score scorer = check_scoring(estimator, scoring=scoring) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 238, in check_scoring return get_scorer(scoring) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 197, in get_scorer % (scoring, sorted(SCORERS.keys()))) ValueError: 'topNscorer' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'] At a high level, I want to find out if the true label was found in the top N multi-class labels coming out of an SGD classifier. Built-in scores like "accuracy" only look at N=1. Here is the code using make_scorer : LRclassifier = SGDClassifier(loss='log') topNscorer = make_scorer(topNscoring, greater_is_better=True, needs_proba=True) accuracyN = mean(cross_val_score(LRclassifier, Data, Labels, scoring = 'topNscorer')) Here is the code for the custom scoring function : def topNscoring(y, yp): ## Inputs y = true label per sample, yp = predict_proba probabilities of all labels per sample N = 5 foundN = [] for ii in xrange(0,shape(yp)[0]): indN = [ w[0] for w in sorted(enumerate(list(yp[ii,:])),key=lambda w:w[1],reverse=True)[0:N] ] if y[ii] in indN: foundN.append(1) else: foundN.append(0) return mean(foundN) Any help will be greatly appreciated. best regards, Sumeet -------------- next part -------------- An HTML attachment was scrubbed... URL: