From shiduan at ucdavis.edu Mon Jan 1 17:56:28 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Mon, 1 Jan 2018 14:56:28 -0800 Subject: [scikit-learn] clustering on big dataset Message-ID: Hi all, I wonder if there is any method to do exact clustering (hierarchy cluster) on a huge dataset where it is impossible to use distance matrix. I am considering KD-tree but every time it needs to rebuild it, consuming lots time. Any ideas? -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Jan 2 09:02:24 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 2 Jan 2018 15:02:24 +0100 Subject: [scikit-learn] clustering on big dataset In-Reply-To: References: Message-ID: Have you had a look at BIRCH? http://scikit-learn.org/stable/modules/clustering.html#birch -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Wed Jan 3 08:33:03 2018 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Wed, 3 Jan 2018 14:33:03 +0100 Subject: [scikit-learn] MLPClassifier as a feature selector In-Reply-To: <460c5520-3226-4aaf-bcbd-343d1e4a7e0e@normalesup.org> References: <460c5520-3226-4aaf-bcbd-343d1e4a7e0e@normalesup.org> Message-ID: I agree with Gael on this one and am happy to help with the PR if you need any assistance. Best, Maciek ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2017-12-29 18:14 GMT+01:00 Gael Varoquaux : > I think that a transform method would be good. We would have to add a > parameter to the constructor to specify which layer is used for the > transform. It should default to "-1", in my opinion. > > Cheers, > > Ga?l > > Sent from my phone. Please forgive typos and briefness. > On Dec 29, 2017, at 17:48, "Javier L?pez" wrote: > >> Hi Thomas, >> >> it is possible to obtain the activation values of any hidden layer, but >> the >> procedure is not completely straight forward. If you look at the code of >> the `_predict` method of MLPs you can see the following: >> >> ```python >> def _predict(self, X): >> """Predict using the trained model >> >> Parameters >> ---------- >> X : {array-like, sparse matrix}, shape (n_samples, n_features) >> The input data. >> >> Returns >> ------- >> y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs) >> The decision function of the samples for each class in the >> model. >> """ >> X = check_array(X, accept_sparse=['csr', 'csc', 'coo']) >> >> # Make sure self.hidden_layer_sizes is a list >> hidden_layer_sizes = self.hidden_layer_sizes >> if not hasattr(hidden_layer_sizes, "__iter__"): >> hidden_layer_sizes = [hidden_layer_sizes] >> hidden_layer_sizes = list(hidden_layer_sizes) >> >> layer_units = [X.shape[1]] + hidden_layer_sizes + \ >> [self.n_outputs_] >> >> # Initialize layers >> activations = [X] >> >> for i in range(self.n_layers_ - 1): >> activations.append(np.empty((X.shape[0], >> layer_units[i + 1]))) >> # forward propagate >> self._forward_pass(activations) >> y_pred = activations[-1] >> >> return y_pred >> ``` >> >> the line `y_pred = activations[-1]` is responsible for extracting the >> values for the last layer, >> but the `activations` variable contains the values for all the neurons. >> >> You can make this function into your own external method (changing the >> `self` attribute by >> a proper parameter) and add an extra argument which specifies the >> layer(s) that you want. >> I have done this myself in order to make an AutoEncoderNetwork out of the >> MLP >> implementation. >> >> This makes me wonder, would it be worth adding this to sklearn? >> A very simple way would be to refactor the `_predict` method, with the >> additional layer >> argument, to a new method `_predict_layer`, then we can have the >> `_predict` method >> simply call `_predict_layer(..., layer=-1)` and add a new method (perhaps >> a `transform`?) >> that allows to get (raveled) values for an arbitrary subset of the layers. >> >> I'd be happy to submit a PR if you guys think it would be interesting for >> the project. >> >> Javier >> >> >> >> On Thu, Dec 7, 2017 at 12:51 AM Thomas Evangelidis >> wrote: >> >>> Greetings, >>> >>> I want to train a MLPClassifier with one hidden layer and use it as a >>> feature selector for an MLPRegressor. >>> Is it possible to get the values of the neurons from the last hidden >>> layer of the MLPClassifier to pass them as input to the MLPRegressor? >>> >>> If it is not possible with scikit-learn, is anyone aware of any >>> scikit-compatible NN library that offers this functionality? For example >>> this one: >>> >>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >>> >>> I wouldn't like to do this in Tensorflow because the MLP there is much >>> slower than scikit-learn's implementation. >>> >> ------------------------------ >> >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Wed Jan 3 12:39:33 2018 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 3 Jan 2018 09:39:33 -0800 Subject: [scikit-learn] pomegranate v0.9.0 released: probabilistic modeling for Python Message-ID: Howdy all! I'm pleased to announced the release of pomegranate v0.9.0. The focus of this release is on missing value support across all model fitting / structure learning / inference methods and models. This enables you to do everything from fitting a multivariate Gaussian distribution to an incomplete data set (using a GPU if desired!) to learning the structure of a Bayesian network on an incomplete data set, to running Viterbi decoding using a hidden Markov model on a sequence with some missing values. Read more about it here: http://bit.ly/2CyrXtX Thanks! Jacob -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Wed Jan 3 19:04:18 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Wed, 3 Jan 2018 16:04:18 -0800 Subject: [scikit-learn] clustering on big dataset In-Reply-To: References: Message-ID: Yes, it is an efficient method, still, we need to specify the number of clusters or the threshold. Is there another way to run hierarchy clustering on the big dataset? The main problem is the distance matrix. Thanks. On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel wrote: > Have you had a look at BIRCH? > > http://scikit-learn.org/stable/modules/clustering.html#birch > > -- > Olivier > ? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Jan 3 19:34:58 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Thu, 4 Jan 2018 01:34:58 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: I've read about Dask and it is a tool I want to have in my belt especially for using the SGE connection in order to run GridSearchCV on the supercomputer center I have access to. Should it work as promised it will be one of my favs. As far as my toy example I keep more limited goals with this graph: I am not currently interested in parallelizing each step as I guess that parallelizing each graph fit through gridSearchCV will be more similar to what I need. I keep working on a proof concept. You can have a look at: https://github.com/mcasl/PAELLA/blob/master/pipeGraph.py along with a few unitary tests: https://github.com/mcasl/PAELLA/blob/master/tests/test_pipeGraph.py As of today, I have an iterable graph of steps that can be fitted/run depending on their role (some can be disable during run while active during fit or vice-versa). I still have to play a bit with injecting different parameters to make it compatible with gridSearchCV and learn a bit about the memory options in order to cache results. Any comments highly appreciated, truly! Manolo 2017-12-30 15:34 GMT+01:00 Fr?d?ric Bastien : > This start to look as the dask project. Do you know it? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sky188133882 at 163.com Thu Jan 4 01:49:46 2018 From: sky188133882 at 163.com (=?GBK?B?wO7R7w==?=) Date: Thu, 4 Jan 2018 14:49:46 +0800 (CST) Subject: [scikit-learn] A necessary feature for Decision trees Message-ID: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> Hi, I`m a graduate student utilizing sklean for some data work. And when I`m handling the data using the Decision Trees library, I found there are some inconvenience: Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. For example, the categorical feature like app name such as google, facebook can`t be input into the model, because they can`t be transformed to continuous value properly. And there don`t exist a corresponding algorithm to divide discrete feature in the Decision Trees library. However, the CART algorithm itself has considered the use of categorical feature. So I have made some modification of Decision Trees library based on CART and apply the new model on my own work. And it proves that the support for categorical feature indeed improves the performance, which is very necessary for decision tree, I think. I`m very willing to contribute this to sklearn community, but I`m new to this community, not so familiar about the procedure. Could u give some suggestions or comments on this new feature? Or has anyone already processed on this feature? Thank you so much. Best wishes! -- ????? ?? ?????? ???? ? ???? ?? ???18818212371 ????????????800? ???200240 Yang Li +86 188 1821 2371 Shanghai Jiao Tong University School of Electronic?Information and Electrical Engineering F1203026 800 Dongchuan Road, Minhang District, Shanghai 200240 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Thu Jan 4 02:30:34 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 4 Jan 2018 16:30:34 +0900 Subject: [scikit-learn] A necessary feature for Decision trees In-Reply-To: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> References: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> Message-ID: Dear Yang Li, > Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to do the same thing for you automatically. Cheers, J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sky188133882 at 163.com Thu Jan 4 03:06:22 2018 From: sky188133882 at 163.com (=?GBK?B?wO7R7w==?=) Date: Thu, 4 Jan 2018 16:06:22 +0800 (CST) Subject: [scikit-learn] A necessary feature for Decision trees In-Reply-To: References: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> Message-ID: <3b9a51c4.9ec9.160c034edc0.Coremail.sky188133882@163.com> Dear J.B., Thanks for your advice! Yeah, I have considered using bitstring or sequence number, but the problem is the algorithm not the representation of categorical data. Take the regression tree as an example, the algorithm in sklearn find a split value of the feature, and find the best split by computing the minimal impurity of child nodes. However, find a split of the categorical feature is not that meaningful even though u represent it as continuous value, and the split result is partially depends on how u permute the value in categorical feature, which is not very persuasive. Instead, in the CART algorithm, u should separate each category in the feature from others and compute the impurity of the two sets. Then find the best separation strategy with the minimal impurity. Obviously, this separation process can`t be finished by current algorithm which simply use the split method on continuous value. One more possible shortcoming is the categorical feature can`t be properly visualized. when forming a tree graph, it`s hard to get information from the categorical feature node while u just split it. Thank you for your time! Best wishes. -- ????? ?? ?????? ???? ? ???? ?? ???18818212371 ????????????800? ???200240 Yang Li +86 188 1821 2371 Shanghai Jiao Tong University School of Electronic?Information and Electrical Engineering F1203026 800 Dongchuan Road, Minhang District, Shanghai 200240 At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" wrote: Dear Yang Li, > Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to do the same thing for you automatically. Cheers, J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jan 4 06:55:49 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 4 Jan 2018 22:55:49 +1100 Subject: [scikit-learn] clustering on big dataset In-Reply-To: References: Message-ID: Can you use nearest neighbors with a KD tree to build a distance matrix that is sparse, in that distances to all but the nearest neighbors of a point are (near-)infinite? Yes, this again has an additional parameter (neighborhood size), just as BIRCH has its threshold. I suspect you will not be able to improve on having another, approximating, parameter. You do not need to set n_clusters to a fixed value for BIRCH. You only need to provide another clusterer, which has its own parameters, although you should be able to experiment with different "global clusterers". On 4 January 2018 at 11:04, Shiheng Duan wrote: > Yes, it is an efficient method, still, we need to specify the number of > clusters or the threshold. Is there another way to run hierarchy clustering > on the big dataset? The main problem is the distance matrix. > Thanks. > > On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel > wrote: > >> Have you had a look at BIRCH? >> >> http://scikit-learn.org/stable/modules/clustering.html#birch >> >> -- >> Olivier >> ? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Thu Jan 4 10:02:48 2018 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 4 Jan 2018 16:02:48 +0100 Subject: [scikit-learn] A necessary feature for Decision trees In-Reply-To: <3b9a51c4.9ec9.160c034edc0.Coremail.sky188133882@163.com> References: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> <3b9a51c4.9ec9.160c034edc0.Coremail.sky188133882@163.com> Message-ID: <1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es> Hi Yang Li, I have to agree with you. Bitset and/or one hot encoding are just hacks which should not be necessary for decision tree learners. There is some WIP on an implementation for natural handling of categorical features in trees: please take a look at https://github.com/scikit-learn/scikit-learn/pull/4899 Cheers! -- Julio > El 4 ene 2018, a las 9:06, ?? escribi?: > > Dear J.B., > > Thanks for your advice! > > Yeah, I have considered using bitstring or sequence number, but the problem is the algorithm not the representation of categorical data. > Take the regression tree as an example, the algorithm in sklearn find a split value of the feature, and find the best split by computing the minimal impurity of child nodes. > However, find a split of the categorical feature is not that meaningful even though u represent it as continuous value, and the split result is partially depends on how u permute the value in categorical feature, which is not very persuasive. > Instead, in the CART algorithm, u should separate each category in the feature from others and compute the impurity of the two sets. Then find the best separation strategy with the minimal impurity. > Obviously, this separation process can`t be finished by current algorithm which simply use the split method on continuous value. > > One more possible shortcoming is the categorical feature can`t be properly visualized. when forming a tree graph, it`s hard to get information from the categorical feature node while u just split it. > > Thank you for your time! > Best wishes. > > > > > -- > ????? > > > ?? > ?????? ???? ? ???? ?? > ???18818212371 > ????????????800? > ???200240 > > Yang Li +86 188 1821 2371 > Shanghai Jiao Tong University > School of Electronic?Information and Electrical Engineering F1203026 > 800 Dongchuan Road, Minhang District, Shanghai 200240 > > > > > At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" wrote: > Dear Yang Li, > > > Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. > > Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to do the same thing for you automatically. > > Cheers, > J.B. > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Jan 4 13:45:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 4 Jan 2018 13:45:17 -0500 Subject: [scikit-learn] A necessary feature for Decision trees In-Reply-To: <1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es> References: <41994e1a.8536.160bfeece76.Coremail.sky188133882@163.com> <3b9a51c4.9ec9.160c034edc0.Coremail.sky188133882@163.com> <1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es> Message-ID: <296673fd-34c0-8723-048a-8e40561d9e7f@gmail.com> Your contribution would be very welcome, I think the current work has stalled. On 01/04/2018 10:02 AM, Julio Antonio Soto de Vicente wrote: > Hi Yang Li, > > I have to agree with you. Bitset and/or one hot encoding are just > hacks which should not be necessary for decision tree learners. > > There is some WIP on an implementation for natural handling of > categorical features in trees: please take a look at > https://github.com/scikit-learn/scikit-learn/pull/4899 > > Cheers! > > -- > Julio > > El 4 ene 2018, a las 9:06, ?? > escribi?: > >> Dear J.B., >> >> Thanks for your advice! >> >> Yeah, I have considered using bitstring or sequence number, but the >> problem is the algorithm not the representation of categorical data. >> Take the regression tree as an example, the algorithm in sklearn find >> a split value of the feature, and find the best split by computing >> the minimal impurity of child nodes. >> However, find a split of the categorical feature is not that >> meaningful even though u represent it as continuous value, and the >> split result is partially depends on how u permute the value in >> categorical feature, which is not very persuasive. >> Instead, in the CART algorithm, *u should separate each category in >> the feature from others and compute the impurity of the two sets. >> Then find the best separation strategy with the minimal impurity.* >> Obviously, this separation process can`t be finished by current >> algorithm which simply use the split method on continuous value. >> >> One more possible shortcoming is the categorical feature can`t be >> properly visualized. when forming a tree graph, it`s hard to get >> information from the categorical feature node while u just split it. >> >> Thank you for your time! >> Best wishes. >> >> >> >> >> -- >> ????? >> >> * >> * >> ?? >> ?????? ???? ? ???? ?? >> ???18818212371 >> ????????????800? >> ???200240 >> >> Yang Li ?+86 188 1821 2371 >> Shanghai Jiao Tong University >> School of Electronic?Information and Electrical Engineering F1203026 >> 800 Dongchuan Road, Minhang District, Shanghai 200240 >> >> >> >> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" >> > wrote: >> >> Dear Yang Li, >> >> > Neither the classificationTree nor the regressionTree supports >> categorical feature. That means the Decision trees model can only >> accept continuous feature. >> >> Consider either manually encoding your categories in bitstrings >> (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or >> using OneHotEncoder to do the same thing for you automatically. >> >> Cheers, >> J.B. >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanv at berkeley.edu Thu Jan 4 16:12:30 2018 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Thu, 04 Jan 2018 13:12:30 -0800 Subject: [scikit-learn] Position at BIDS (UC Berkeley) to work on NumPy Message-ID: <1515100350.2959140.1224581408.1FC64DEA@webmail.messagingengine.com> Hi everyone, The Berkeley Institute for Data Science (BIDS) is hiring scientific Python Developers to contribute to NumPy. You can read more about the new positions here: https://bids.berkeley.edu/news/bids-receives-sloan-foundation-grant-contribute-numpy-development If you enjoy collaborative work as well as the technical challenges posed by numerical computing, this is an excellent opportunity to play a fundamental role in the development of one of the most impactful libraries in the entire Python ecosystem. Best regards St?fan Job link: https://jobsprod.is.berkeley.edu/psc/jobsprod/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_JOB_DTL&Action=A&JobOpeningId=24142&SiteId=1&PostingSeq=1 From shiduan at ucdavis.edu Thu Jan 4 20:51:58 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Thu, 4 Jan 2018 17:51:58 -0800 Subject: [scikit-learn] clustering on big dataset In-Reply-To: References: Message-ID: Thanks, Joel, I am working on KD-tree to find the nearest neighbors. Basically, I find the nearest neighbors for each point and then merge a couple of points if they are both NN for each other. The problem is that after each iteration, we will have a new bunch of points, where new clusters are added. So the tree needs to be updated. Since I didn't find any dynamic way to update the tree, I just rebuild it after each iteration, costing lots of time. Any idea about it? Actually, it takes around 16 mins to build the tree in the first iteration, which is not slow I think. But it still runs slowly. I have a dataset of 12*872505 (features, samples). It takes several days to run the program. Is there any way to speed up the query process of NN? I doubt query may be too slow. Thanks for your time. On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothman wrote: > Can you use nearest neighbors with a KD tree to build a distance matrix > that is sparse, in that distances to all but the nearest neighbors of a > point are (near-)infinite? Yes, this again has an additional parameter > (neighborhood size), just as BIRCH has its threshold. I suspect you will > not be able to improve on having another, approximating, parameter. You do > not need to set n_clusters to a fixed value for BIRCH. You only need to > provide another clusterer, which has its own parameters, although you > should be able to experiment with different "global clusterers". > > On 4 January 2018 at 11:04, Shiheng Duan wrote: > >> Yes, it is an efficient method, still, we need to specify the number of >> clusters or the threshold. Is there another way to run hierarchy clustering >> on the big dataset? The main problem is the distance matrix. >> Thanks. >> >> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> wrote: >> >>> Have you had a look at BIRCH? >>> >>> http://scikit-learn.org/stable/modules/clustering.html#birch >>> >>> -- >>> Olivier >>> ? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jan 4 21:49:46 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 5 Jan 2018 13:49:46 +1100 Subject: [scikit-learn] clustering on big dataset In-Reply-To: References: Message-ID: Yes, use an approximate nearest neighbors approach. None is included in scikit-learn, but there are numerous implementations with Python interfaces. On 5 January 2018 at 12:51, Shiheng Duan wrote: > Thanks, Joel, > I am working on KD-tree to find the nearest neighbors. Basically, I find > the nearest neighbors for each point and then merge a couple of points if > they are both NN for each other. The problem is that after each iteration, > we will have a new bunch of points, where new clusters are added. So the > tree needs to be updated. Since I didn't find any dynamic way to update the > tree, I just rebuild it after each iteration, costing lots of time. Any > idea about it? > Actually, it takes around 16 mins to build the tree in the first > iteration, which is not slow I think. But it still runs slowly. I have a > dataset of 12*872505 (features, samples). It takes several days to run the > program. Is there any way to speed up the query process of NN? I doubt > query may be too slow. > Thanks for your time. > > On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothman > wrote: > >> Can you use nearest neighbors with a KD tree to build a distance matrix >> that is sparse, in that distances to all but the nearest neighbors of a >> point are (near-)infinite? Yes, this again has an additional parameter >> (neighborhood size), just as BIRCH has its threshold. I suspect you will >> not be able to improve on having another, approximating, parameter. You do >> not need to set n_clusters to a fixed value for BIRCH. You only need to >> provide another clusterer, which has its own parameters, although you >> should be able to experiment with different "global clusterers". >> >> On 4 January 2018 at 11:04, Shiheng Duan wrote: >> >>> Yes, it is an efficient method, still, we need to specify the number of >>> clusters or the threshold. Is there another way to run hierarchy clustering >>> on the big dataset? The main problem is the distance matrix. >>> Thanks. >>> >>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> > wrote: >>> >>>> Have you had a look at BIRCH? >>>> >>>> http://scikit-learn.org/stable/modules/clustering.html#birch >>>> >>>> -- >>>> Olivier >>>> ? >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sumeet.k.sandhu at gmail.com Sun Jan 7 14:35:06 2018 From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu) Date: Sun, 7 Jan 2018 11:35:06 -0800 Subject: [scikit-learn] sklearn.model_selection.GridSearchCV - unable to use n_jobs>1 on MacOS Sierra python 2.7 Message-ID: Hi, I was able to run this with n_jobs=-1, and the activity monitor does show all 8 CPUs engaged, but the jobs start to die out one by one. I tried with n_jobs=2, same story. The only option that works is n_jobs=1. I played around with 'pre_dispatch' a bit - unclear what that does. GRID = GridSearchCV(LogisticRegression(), param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=10, verbose=0, error_score=0, return_train_score=False) GRID.fit(trainDocumentV,trainLabelV) How can I sustain at least 3-4 parallel jobs? thanks, Sumeet -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jan 7 18:35:46 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 8 Jan 2018 10:35:46 +1100 Subject: [scikit-learn] sklearn.model_selection.GridSearchCV - unable to use n_jobs>1 on MacOS Sierra python 2.7 In-Reply-To: References: Message-ID: What do you mean by "the jobs start to die out one by one"? Surely the jobs should finish and die out one by one...? On 8 January 2018 at 06:35, Sumeet Sandhu wrote: > Hi, > > I was able to run this with n_jobs=-1, and the activity monitor does show > all 8 CPUs engaged, but the jobs start to die out one by one. I tried with > n_jobs=2, same story. > The only option that works is n_jobs=1. > I played around with 'pre_dispatch' a bit - unclear what that does. > > GRID = GridSearchCV(LogisticRegression(), param_grid, scoring=None, > fit_params=None, n_jobs=1, iid=True, refit=True, cv=10, verbose=0, > error_score=0, return_train_score=False) > GRID.fit(trainDocumentV,trainLabelV) > > > How can I sustain at least 3-4 parallel jobs? > > thanks, > Sumeet > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Sun Jan 7 18:46:55 2018 From: ross at cgl.ucsf.edu (Bill Ross) Date: Sun, 7 Jan 2018 15:46:55 -0800 Subject: [scikit-learn] sklearn.model_selection.GridSearchCV - unable to use n_jobs>1 on MacOS Sierra python 2.7 In-Reply-To: References: Message-ID: <7b9e2c06-598e-afe2-915f-fc08f976167e@cgl.ucsf.edu> What interval between dying, vs. how long is the run overall? Obviously you want that ratio to be 'small enough'. On 1/7/18 3:35 PM, Joel Nothman wrote: > What do you mean by "the jobs start to die out one by one"? Surely the > jobs should finish and die out one by one...? > > On 8 January 2018 at 06:35, Sumeet Sandhu > wrote: > > Hi, > > I was able to run this with n_jobs=-1, and the activity monitor > does show all 8 CPUs engaged, but the jobs start to die out one by > one. I tried with n_jobs=2, same story. > The only option that works is n_jobs=1. > I played around with 'pre_dispatch' a bit - unclear what that does. > > GRID = GridSearchCV(LogisticRegression(), param_grid, > scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, > cv=10, verbose=0, error_score=0, return_train_score=False) > GRID.fit(trainDocumentV,trainLabelV) > > > How can I sustain at least 3-4 parallel jobs? > > thanks, > Sumeet > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Mon Jan 8 18:58:07 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 9 Jan 2018 00:58:07 +0100 Subject: [scikit-learn] Any plans on generalizing Pipeline and transformers? In-Reply-To: References: Message-ID: Just a quick ping to share that I've kept playing with this PipeGraph toy. The following example reflects its current state. * As you can see scikit-learn models can be used as steps in the nodes of the graph just by saying so, for example: 'Gaussian_Mixture': {'step': GaussianMixture, 'kargs': {'n_components': 3}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, * Custom steps need succint declarations with very little code * Graph description is nice to read, in my humble opinion. * Optional 'fit' and/or 'run' roles * TO-DO: Using memory option to cache and making it compatible with gridSearchCv. I was too busy playing with template methods in order to simplify its use. I have convinced some nice colleagues at my university to team up with me and write some nice documentation Best wishes Manolo import pandas as pd import numpy as np from sklearn.cluster import DBSCAN from sklearn.mixture import GaussianMixture from sklearn.linear_model import LinearRegression # work in progress library: https://github.com/mcasl/PAELLA/ from pipeGraph import (PipeGraph, FirstStep, LastStep, CustomStep) from paella import Paella URL = " https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv " data = pd.read_csv(URL, usecols=['V1', 'V2']) X, y = data[['V1']], data[['V2']] class CustomConcatenationStep(CustomStep): def _post_fit(self): self.output['Xy'] = pd.concat(self.input, axis=1) class CustomCombinationStep(CustomStep): def _post_fit(self): self.output['classification'] = np.where(self.input['dominant'] < 0, self.input['dominant'], self.input['other']) class CustomPaellaStep(CustomStep): def _pre_fit(self): self.sklearn_object = Paella(**self.kargs) def _fit(self): self.sklearn_object.fit(**self.input) def _post_fit(self): self.output['prediction'] = self.sklearn_object.transform(self.input['X'], self.input['y']) graph_description = { 'First': {'step': FirstStep, 'connections': {'X': X, 'y': y}, 'use_for': ['fit', 'run'], }, 'Concatenate_Xy': {'step': CustomConcatenationStep, 'connections': {'df1': ('First', 'X'), 'df2': ('First', 'y')}, 'use_for': ['fit'], }, 'Gaussian_Mixture': {'step': GaussianMixture, 'kargs': {'n_components': 3}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, 'Dbscan': {'step': DBSCAN, 'kargs': {'eps': 0.05}, 'connections': {'X': ('Concatenate_Xy', 'Xy')}, 'use_for': ['fit'], }, 'Combine_Clustering': {'step': CustomCombinationStep, 'connections': {'dominant': ('Dbscan', 'prediction'), 'other': ('Gaussian_Mixture', 'prediction')}, 'use_for': ['fit'], }, 'Paella': {'step': CustomPaellaStep, 'kargs': {'noise_label': -1, 'max_it': 20, 'regular_size': 400, 'minimum_size': 100, 'width_r': 0.99, 'n_neighbors': 5, 'power': 30, 'random_state': None}, 'connections': {'X': ('First', 'X'), 'y': ('First', 'y'), 'classification': ('Combine_Clustering', 'classification')}, 'use_for': ['fit'], }, 'Regressor': {'step': LinearRegression, 'kargs': {}, 'connections': {'X': ('First', 'X'), 'y': ('First', 'y'), 'sample_weight': ('Paella', 'prediction')}, 'use_for': ['fit', 'run'], }, 'Last': {'step': LastStep, 'connections': {'prediction': ('Regressor', 'prediction'), }, 'use_for': ['fit', 'run'], }, } pipegraph = PipeGraph(graph_description) pipegraph.fit() #Fitting: First #Fitting: Concatenate_Xy #Fitting: Dbscan #Fitting: Gaussian_Mixture #Fitting: Combine_Clustering #Fitting: Paella #0 , #1 , #2 , #3 , #4 , #5 , #6 , #7 , #8 , #9 , #10 , #11 , #12 , #13 , #14 , #15 , #16 , #17 , #18 , #19 , #Fitting: Regressor #Fitting: Last pipegraph.run() #Running: First #Running: Regressor #Running: Last 2017-12-19 13:44 GMT+01:00 Manuel Castej?n Limas : > Dear all, > > Kudos to scikit-learn! Having said that, Pipeline is killing me not being > able to transform anything other than X. > > My current study case would need: > - Transformers being able to handle both X and y, e.g. clustering X and y > concatenated > - Pipeline being able to change other params, e.g. sample_weight > > Currently, I'm augmenting X through every step with the extra information > which seems to work ok for my_pipe.fit_transform(X_train,y_train) but > breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I > can inherit and modify a descendant from Pipeline class to allow the y > parameter which is not ideal but I guess it is an option. The gritty part > comes when having to adapt every regressor at the end of the ladder in > order to split the extra information from the raw data in X and not being > able to generate more than one subproduct from each preprocessing step > > My current research involves clustering the data and using that > classification along with X in order to predict outliers which generates > sample_weight info and I would love to use that on the final regressor. > Currently there seems not to be another option than pasting that info on X. > > All in all, I'm stuck with this API limitation and I would love to learn > some tricks from you if you could enlighten me. > > Thanks in advance! > > Manuel Castej?n-Limas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sumeet.k.sandhu at gmail.com Tue Jan 9 00:22:16 2018 From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu) Date: Mon, 8 Jan 2018 21:22:16 -0800 Subject: [scikit-learn] sklearn.model_selection.GridSearchCV - unable to use n_jobs>1 on MacOS Sierra python 2.7 Message-ID: There are two cases : n_jobs > 1 works when data is smaller - when the training docs numpy array is 15MB. It does not work when training matrix is 100MB. My Mac has 16GB RAM. In the second case, the jobs die out pretty quickly, in seconds, and the main python process seems to die out (min CPU usage). There is a popup message saying 'python processes appear to have died'. This is when i run python on bash command line. When I run in python GUI IDLE, a message pops up 'your program is still running, sure you want to close window'. What are these jobs anyway? Are they various parameter combinations in param_grid, or lower level jobs out of compiler etc? Does each job replicate the training data in RAM? regards On Sun, Jan 7, 2018 at 11:35 AM, Sumeet Sandhu wrote: > Hi, > > I was able to run this with n_jobs=-1, and the activity monitor does show > all 8 CPUs engaged, but the jobs start to die out one by one. I tried with > n_jobs=2, same story. > The only option that works is n_jobs=1. > I played around with 'pre_dispatch' a bit - unclear what that does. > > GRID = GridSearchCV(LogisticRegression(), param_grid, scoring=None, > fit_params=None, n_jobs=1, iid=True, refit=True, cv=10, verbose=0, > error_score=0, return_train_score=False) > GRID.fit(trainDocumentV,trainLabelV) > > > How can I sustain at least 3-4 parallel jobs? > > thanks, > Sumeet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sumeet.k.sandhu at gmail.com Wed Jan 10 14:41:15 2018 From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu) Date: Wed, 10 Jan 2018 11:41:15 -0800 Subject: [scikit-learn] sklearn.model_selection.GridSearchCV - unable to use n_jobs>1 on MacOS Sierra python 2.7 In-Reply-To: References: Message-ID: and just now, the first case stopped working too - the 15MB training data causes python to abruptly die. On Mon, Jan 8, 2018 at 9:22 PM, Sumeet Sandhu wrote: > > There are two cases : n_jobs > 1 works when data is smaller - when the > training docs numpy array is 15MB. It does not work when training matrix is > 100MB. My Mac has 16GB RAM. > > In the second case, the jobs die out pretty quickly, in seconds, and the > main python process seems to die out (min CPU usage). There is a popup > message saying 'python processes appear to have died'. This is when i run > python on bash command line. > When I run in python GUI IDLE, a message pops up 'your program is still > running, sure you want to close window'. > > What are these jobs anyway? Are they various parameter combinations in > param_grid, or lower level jobs out of compiler etc? > Does each job replicate the training data in RAM? > > regards > > On Sun, Jan 7, 2018 at 11:35 AM, Sumeet Sandhu > wrote: > >> Hi, >> >> I was able to run this with n_jobs=-1, and the activity monitor does show >> all 8 CPUs engaged, but the jobs start to die out one by one. I tried with >> n_jobs=2, same story. >> The only option that works is n_jobs=1. >> I played around with 'pre_dispatch' a bit - unclear what that does. >> >> GRID = GridSearchCV(LogisticRegression(), param_grid, scoring=None, >> fit_params=None, n_jobs=1, iid=True, refit=True, cv=10, verbose=0, >> error_score=0, return_train_score=False) >> GRID.fit(trainDocumentV,trainLabelV) >> >> >> How can I sustain at least 3-4 parallel jobs? >> >> thanks, >> Sumeet >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ap.heiner at gmail.com Fri Jan 12 11:40:21 2018 From: ap.heiner at gmail.com (andreas heiner) Date: Fri, 12 Jan 2018 18:40:21 +0200 Subject: [scikit-learn] MPLclassifier Message-ID: Hi, I try to apply the MPLclassifier to a subset (100 data points, 2 classes) of the 20newsgroup dataset. I created (ok, copied) the following pipeline model_MLP = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model_MLP', MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) ) ]) model_MLP.fit(twenty_train.data, twenty_train.target) predicted_MLP = model_MLP.predict(twenty_test.data) print(metrics.classification_report(twenty_test.target, predicted_MLP, target_names=twenty_test.target_names)) The numbers I get are hopeless, precision recall f1-score support alt.atheism 0.00 0.00 0.00 34 sci.electronics 0.66 1.00 0.80 66 The only reason I can think of is that the dictionaries of the training and the test set are not the same (testset: 5204 words, training set: 5402 words). That should not be a problem (if I understand Bayes correctly), but it certainly gives rubbish (see the numbers). The same setup with the SVD routine works great, all values are around .95 thanks, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 13 21:56:51 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 14 Jan 2018 13:56:51 +1100 Subject: [scikit-learn] MPLclassifier In-Reply-To: References: Message-ID: I don't think this is an issue directly related to scikit-learn. Your classifier is learning to always predict the majority class. If you do not have good training performance, then you either need more data or your model is in appropriate. You're trying to learn lots of parameters from 100 examples. Use a simpler model. Use stronger regularisation (higher alpha). Work through some tutorials on machine learning diagnostics and modelling choices. On 13 Jan 2018 3:42 am, "andreas heiner" wrote: > Hi, > > I try to apply the MPLclassifier to a subset (100 data points, 2 classes) > of the 20newsgroup dataset. I created (ok, copied) the following pipeline > > model_MLP = Pipeline([('vect', CountVectorizer()), > ('tfidf', TfidfTransformer()), > ('model_MLP', MLPClassifier(solver='lbfgs', > alpha=1e-5, > hidden_layer_sizes=(5, 2), > random_state=1) > ) > ]) > > model_MLP.fit(twenty_train.data, twenty_train.target) > > predicted_MLP = model_MLP.predict(twenty_test.data) > > print(metrics.classification_report(twenty_test.target, predicted_MLP, > target_names=twenty_test.target_names)) > > The numbers I get are hopeless, > > precision recall f1-score support > alt.atheism 0.00 0.00 0.00 34 > sci.electronics 0.66 1.00 0.80 66 > > The only reason I can think of is that the dictionaries of the training > and the test set are not the same (testset: 5204 words, training set: 5402 > words). That should not be a problem (if I understand Bayes correctly), but > it certainly gives rubbish (see the numbers). > > The same setup with the SVD routine works great, all values are around .95 > > thanks, > > Andreas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abdul.sw84 at gmail.com Sun Jan 14 18:18:57 2018 From: abdul.sw84 at gmail.com (Abdul Abdul) Date: Sun, 14 Jan 2018 19:18:57 -0400 Subject: [scikit-learn] How to train an image classifier on directories Message-ID: Hello, I'm trying to train an image classifier, but a bit confused on how to label my data. The issue here is that for each class I have subdirectories, each of which contains two images. So, it is not I have classes, and in each class I simply have the images that come under that class (i.e. cats vs. dogs). I will show here some attempts for grouping the data together, but not yet able to figure how to assign the label, and pass the pairs of images along with the label to the image classifier. So, that's how I simply read the two images: im1 = cv2.imread('img1.jpg') im1 = img_to_array(im1) im2 = cv2.imread('img2.jpg') im2 = img_to_array(im2) I then *pair* the images as follows: pair = (im1,im2) For labeling, this is what I did: label = root.split(os.path.sep)[-2] label = 1 if label == 'cat' else 0 How can I group the above pairs of images (im1,im2) and attach the label to them? Especially that I want to pass them to the following scikit-learn function: (trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.25, random_state=42) Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jan 14 18:59:51 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 15 Jan 2018 10:59:51 +1100 Subject: [scikit-learn] How to train an image classifier on directories In-Reply-To: References: Message-ID: Why not just do the train_test_split over directory names, and later (e.g. in a Pipeline) read in the images? On 15 January 2018 at 10:18, Abdul Abdul wrote: > Hello, > > I'm trying to train an image classifier, but a bit confused on how to > label my data. The issue here is that for each class I have subdirectories, > each of which contains two images. So, it is not I have classes, and in > each class I simply have the images that come under that class (i.e. cats > vs. dogs). > > I will show here some attempts for grouping the data together, but not yet > able to figure how to assign the label, and pass the pairs of images along > with the label to the image classifier. > > So, that's how I simply read the two images: > > im1 = cv2.imread('img1.jpg') > im1 = img_to_array(im1) > > im2 = cv2.imread('img2.jpg') > im2 = img_to_array(im2) > > I then *pair* the images as follows: > > pair = (im1,im2) > > For labeling, this is what I did: > > label = root.split(os.path.sep)[-2] > label = 1 if label == 'cat' else 0 > > How can I group the above pairs of images (im1,im2) and attach the label > to them? Especially that I want to pass them to the following scikit-learn > function: > > (trainX, testX, trainY, testY) = train_test_split(data, > labels, test_size=0.25, random_state=42) > > Thanks. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amanpratik10 at gmail.com Wed Jan 17 13:58:07 2018 From: amanpratik10 at gmail.com (Aman Pratik) Date: Thu, 18 Jan 2018 00:28:07 +0530 Subject: [scikit-learn] GSoC 2018 Message-ID: Hello Admins, I wished to know if scikit-learn would participate in GSoC this year (2018). Thanks, Aman Pratik -------------- next part -------------- An HTML attachment was scrubbed... URL: From gauravdhingra.gxyd at gmail.com Tue Jan 23 10:42:36 2018 From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra) Date: Tue, 23 Jan 2018 21:12:36 +0530 Subject: [scikit-learn] Fwd: Re: Topic for thesis work on scikit learn In-Reply-To: References: Message-ID: <4be189c1-5e9b-8421-b12d-8648833ce921@gmail.com> -------- Forwarded Message -------- Subject: Re: [scikit-learn] Topic for thesis work on scikit learn Date: Tue, 23 Jan 2018 10:16:36 -0500 From: Andreas Mueller To: Gaurav Dhingra Hi Gaurav. Is your mentor experienced in contributing to sklearn? Will they be able to review your code to the scikit-learn standards? Have you worked on any other pull requests so far? Getting anything into scikit-learn without close collaboration with the community is quite tricky. Having a faster K-means implementation based on recent research in the area would be interesting, There's also interest in adding Robust PCA, probabilistic inference trees, and improving the latent dirichlet alloctation code. You can find issues on any of these in the issue tracker, which also has many more feature requests. Andy On 12/31/2017 05:46 AM, Gaurav Dhingra wrote: > > Hi Andreas, > > I think I'll get access to a local mentor from my college, so I think > I rule that issue out, though for technicalities still I would /like/ > to be more dependent on feedback from the scikit-learn community, > since my aim wouldn't be to make something for my own use but rather > something that would be more useful for the scikit-learn community, so > that it eventually gets merged into master. > > I'm currently looking for topic that I can take up, I tried looking > into scikit-learn wiki but it doesn't mention for what I'm looking for > (no topic is mentioned). Do you have some topic in mind that could be > useful for addition to scikit-learn? Even if you could direct me to > appropriate links I would be happy to look into those. > > > On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote: >> Hi Gaurav. >> >> Do you have a local mentor? I think having a mentor that can guide >> you during a thesis is very important. >> You could get some feedback from the community for a contribution, >> but that can be slow, >> and is entirely on volunteer basis, so there is no guarantee that >> you'll get the necessary feedback in time >> to finish your thesis. >> >> Mentoring a thesis - in particular without knowing you - is a serious >> commitment, so I'm not sure someone >> from inside the project will want to do this. I saw you already made >> a contribution in >> https://github.com/scikit-learn/scikit-learn/pull/10005 >> but that's a very different scope than doing what I expect would be >> several month of work. > > Though in this regard I've made a few more contributions, here is the > link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though I > know none of them is a big contribution. If you think I should work on > a big enough PR, can you please suggest me some issue in that regard? > > Thanks. > >> >> >> Best, >> Andy >> >> On 10/31/2017 03:31 PM, Gaurav Dhingra wrote: >>> Hi everyone, >>> >>> I am a final year (5th year) undergraduate Applied Mathematics >>> student in India. I am thinking of doing my final year thesis by >>> doing some work (coding part) on scikit learn, so I was thinking if >>> anyone could tell me if there are available topics (not necessarily >>> names of those topics) that I could work on being an undergraduate >>> student? I would want to expand upon this in December when my exams >>> will be over. But in the mean time would want to take a step in that >>> direction by just knowing if there will be available topics that I >>> could work on. >>> >>> It could be the case that available topics are not so easy for an >>> undergraduate, still in that case I would like to do some research >>> on the topics first. >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gaurav Dhingra > (sent from Thunderbird email client) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gauravdhingra.gxyd at gmail.com Tue Jan 23 11:09:23 2018 From: gauravdhingra.gxyd at gmail.com (Gaurav Dhingra) Date: Tue, 23 Jan 2018 21:39:23 +0530 Subject: [scikit-learn] Fwd: Re: Topic for thesis work on scikit learn In-Reply-To: <4be189c1-5e9b-8421-b12d-8648833ce921@gmail.com> References: <4be189c1-5e9b-8421-b12d-8648833ce921@gmail.com> Message-ID: Hi Andreas, On Tuesday 23 January 2018 09:12 PM, Gaurav Dhingra wrote: > > > > > -------- Forwarded Message -------- > Subject: Re: [scikit-learn] Topic for thesis work on scikit learn > Date: Tue, 23 Jan 2018 10:16:36 -0500 > From: Andreas Mueller > To: Gaurav Dhingra > > > > Hi Gaurav. > > Is your mentor experienced in contributing to sklearn? > No, she isn't. > Will they be able to review your code to the scikit-learn standards? > No. > Have you worked on any other pull requests so far? > I've on a few. Please have a look at https://github.com/scikit-learn/scikit-learn/pulls/gxyd, infact I expect that 3 of the open PR's will be merged soon. > Getting anything into scikit-learn without close collaboration with > the community is quite tricky. > > Having a faster K-means implementation based on recent research in the > area would be interesting, > There's also interest in adding Robust PCA, probabilistic inference > trees, and improving the latent dirichlet alloctation code. > I tried to look into what /scikit-learn community/////devs/ consider a priority to have in their code-base (instead of me looking explicitly for topics I like). When I looked, I thought of https://github.com/scikit-learn/scikit-learn/issues/8337, or https://github.com/scikit-learn/scikit-learn/issues/6557 as the possible topics. But since I'm aware that unavailability of yours (busy in teaching purpose can be an issue), so I simultaneously looked for other options. I'd a conversation with Joel (he was kind enough to PM me), this is what he said (only the important part of conversation): | Tricky thinngs we?ve been trying to do for years: |???? * estimator tags |???? * sample props | tools for optimising cluster parameters (e.g. #6948) | sample props == #4497 and associated | related to clusterer parameters, #6160 | estimator tags relates to #6715 | #6777 looks tricky from an ML perspective. I'm thinking of choosing https://github.com/scikit-learn/scikit-learn/pull/6948 (ENH optimal n_clusters value),? i.e completing that PR. If you will be having availability to review my PR's (if I do open them), then I'd glad to work with you on either /Conditional inference trees /or /adding post-pruning for decision trees/. I'm aware as Joel earlier put it /Andreas has escaped into the teaching world/. Anyways, I don't expect my guide to provide me feedback in regards to scikit-learn code, though she will have theoretical explanation to my questions definitely. Also, since we can also have a co-guide (apart from local guide), I would definitely consider that as an option for someone from scikit-learn, even if it be you or may be Joel. But even Joel is expected to get back to academic world as well. If things don't go a little positive (neither you or Joel or may be someone else from scikit-learn community is available), I'm gonna be taking a little longer but I'll eventually get there probably. > You can find issues on any of these in the issue tracker, which also > has many more feature requests. > > Andy > > > On 12/31/2017 05:46 AM, Gaurav Dhingra wrote: >> >> Hi Andreas, >> >> I think I'll get access to a local mentor from my college, so I think >> I rule that issue out, though for technicalities still I would /like/ >> to be more dependent on feedback from the scikit-learn community, >> since my aim wouldn't be to make something for my own use but rather >> something that would be more useful for the scikit-learn community, >> so that it eventually gets merged into master. >> >> I'm currently looking for topic that I can take up, I tried looking >> into scikit-learn wiki but it doesn't mention for what I'm looking >> for (no topic is mentioned). Do you have some topic in mind that >> could be useful for addition to scikit-learn? Even if you could >> direct me to appropriate links I would be happy to look into those. >> >> >> On Wednesday 01 November 2017 01:43 AM, Andreas Mueller wrote: >>> Hi Gaurav. >>> >>> Do you have a local mentor? I think having a mentor that can guide >>> you during a thesis is very important. >>> You could get some feedback from the community for a contribution, >>> but that can be slow, >>> and is entirely on volunteer basis, so there is no guarantee that >>> you'll get the necessary feedback in time >>> to finish your thesis. >>> >>> Mentoring a thesis - in particular without knowing you - is a >>> serious commitment, so I'm not sure someone >>> from inside the project will want to do this. I saw you already made >>> a contribution in >>> https://github.com/scikit-learn/scikit-learn/pull/10005 >>> but that's a very different scope than doing what I expect would be >>> several month of work. >> >> Though in this regard I've made a few more contributions, here is the >> link https://github.com/scikit-learn/scikit-learn/pulls/gxyd, though >> I know none of them is a big contribution. If you think I should work >> on a big enough PR, can you please suggest me some issue in that regard? >> >> Thanks. >> >>> >>> >>> Best, >>> Andy >>> >>> On 10/31/2017 03:31 PM, Gaurav Dhingra wrote: >>>> Hi everyone, >>>> >>>> I am a final year (5th year) undergraduate Applied Mathematics >>>> student in India. I am thinking of doing my final year thesis by >>>> doing some work (coding part) on scikit learn, so I was thinking if >>>> anyone could tell me if there are available topics (not necessarily >>>> names of those topics) that I could work on being an undergraduate >>>> student? I would want to expand upon this in December when my exams >>>> will be over. But in the mean time would want to take a step in >>>> that direction by just knowing if there will be available topics >>>> that I could work on. >>>> >>>> It could be the case that available topics are not so easy for an >>>> undergraduate, still in that case I would like to do some research >>>> on the topics first. >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> -- >> Gaurav Dhingra >> (sent from Thunderbird email client) > -- Gaurav Dhingra (sent from Thunderbird email client) -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jan 24 17:17:20 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 24 Jan 2018 17:17:20 -0500 Subject: [scikit-learn] Fwd: Re: Topic for thesis work on scikit learn In-Reply-To: References: <4be189c1-5e9b-8421-b12d-8648833ce921@gmail.com> Message-ID: <2e19386b-9a57-1548-08aa-00d6af9bc74c@gmail.com> On 01/23/2018 11:09 AM, Gaurav Dhingra wrote: > > > | Tricky thinngs we?ve been trying to do for years: > |???? * estimator tags > |???? * sample props I actually have a student working on estimator tags right now. > I'm thinking of choosing > https://github.com/scikit-learn/scikit-learn/pull/6948 (ENH optimal > n_clusters value),? i.e completing that PR. If you will be having > availability to review my PR's (if I do open them), then I'd glad to > work with you on either /Conditional inference trees /or /adding > post-pruning for decision trees/. No, I don't have time to review PRs. > > If things don't go a little positive (neither you or Joel or may be > someone else from scikit-learn community is available), I'm gonna be > taking a little longer but I'll eventually get there probably. No, it's not possible to contribute to scikit-learn without working with someone from the community. Each pull request requires two reviewers to be merged. And that is usually a prolonged process of back and forth. Without someone stepping up to review, you can't get your code in. -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.mazari at gmail.com Sun Jan 28 00:59:12 2018 From: y.mazari at gmail.com (Yacine MAZARI) Date: Sun, 28 Jan 2018 14:59:12 +0900 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion Message-ID: Hello, I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer". In the current implementation, the definition of term frequency is the number of times a term t occurs in document d. However, another definition that is very commonly used in practice is the term frequency adjusted for document length , i.e: tf = raw counts / document length. I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()". What do you think? If this sounds reasonable an worth it, I will send a PR. Thank you, Yacine. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Sun Jan 28 01:11:08 2018 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Sat, 27 Jan 2018 22:11:08 -0800 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer . Best, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI wrote: > Hello, > > I would like to work on adding an additional feature to > "sklearn.feature_extraction.text.CountVectorizer". > > In the current implementation, the definition of term frequency is the > number of times a term t occurs in document d. > > However, another definition that is very commonly used in practice is the term > frequency adjusted for document length > , i.e: tf > = raw counts / document length. > > I intend to implement this by adding an additional boolean parameter > "relative_frequency" to the constructor of CountVectorizer. > If the parameter is true, normalize X by document length (along x=1) in > "CountVectorizer.fit_transform()". > > What do you think? > If this sounds reasonable an worth it, I will send a PR. > > Thank you, > Yacine. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.mazari at gmail.com Sun Jan 28 02:31:16 2018 From: y.mazari at gmail.com (Yacine MAZARI) Date: Sun, 28 Jan 2018 07:31:16 +0000 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: Hi Jake, Thanks for the quick reply. What I meant is different from the TfIdfVectorizer. Let me clarify: In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument. What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the ?importance? and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04. There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this. Regards, Yacine. On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas wrote: > Hi Yacine, > If I'm understanding you correctly, I think what you have in mind is > already implemented in scikit-learn in the TF-IDF vectorizer > > . > > Best, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI wrote: > >> Hello, >> >> I would like to work on adding an additional feature to >> "sklearn.feature_extraction.text.CountVectorizer". >> >> In the current implementation, the definition of term frequency is the >> number of times a term t occurs in document d. >> >> However, another definition that is very commonly used in practice is the term >> frequency adjusted for document length >> , i.e: tf >> = raw counts / document length. >> >> I intend to implement this by adding an additional boolean parameter >> "relative_frequency" to the constructor of CountVectorizer. >> If the parameter is true, normalize X by document length (along x=1) in >> "CountVectorizer.fit_transform()". >> >> What do you think? >> If this sounds reasonable an worth it, I will send a PR. >> >> Thank you, >> Yacine. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jan 28 04:29:58 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 28 Jan 2018 20:29:58 +1100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer. I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards. On 28 January 2018 at 18:31, Yacine MAZARI wrote: > Hi Jake, > > Thanks for the quick reply. > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which > badically means normalizing the counts by document frequencies, tf * idf. > But still, tf is deined here as the raw count of a term in the dicument. > > What I am suggesting, is to add the possibility to use another definition > of tf, tf= relative frequency of a term in a document = raw counts / > document length. > On top of this, one could further normalize by IDF to get the TF-IDF ( > https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > When can this be useful? Here is an example: > Say term t occurs 5 times in document d1, and also 5 times in document d2. > At first glance, it seems that the term conveys the same information about > both documents. But if we also check document lengths, and find that length > of d1 is 20, wheras lenght of d2 is 200, then probably the ?importance? and > information carried by the same term in the two documents is not the same. > If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 > whereas tf2=5/200=0.04. > > There are many practical cases (document similarity, document > classification, etc...) where using relative frequencies yields better > results, and it might be worth making the CountVectorizer support this. > > Regards, > Yacine. > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas > wrote: > >> Hi Yacine, >> If I'm understanding you correctly, I think what you have in mind is >> already implemented in scikit-learn in the TF-IDF vectorizer >> >> . >> >> Best, >> Jake >> >> Jake VanderPlas >> Senior Data Science Fellow >> Director of Open Software >> University of Washington eScience Institute >> >> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI >> wrote: >> >>> Hello, >>> >>> I would like to work on adding an additional feature to >>> "sklearn.feature_extraction.text.CountVectorizer". >>> >>> In the current implementation, the definition of term frequency is the >>> number of times a term t occurs in document d. >>> >>> However, another definition that is very commonly used in practice is >>> the term frequency adjusted for document length >>> , i.e: >>> tf = raw counts / document length. >>> >>> I intend to implement this by adding an additional boolean parameter >>> "relative_frequency" to the constructor of CountVectorizer. >>> If the parameter is true, normalize X by document length (along x=1) in >>> "CountVectorizer.fit_transform()". >>> >>> What do you think? >>> If this sounds reasonable an worth it, I will send a PR. >>> >>> Thank you, >>> Yacine. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Jan 28 04:31:38 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 28 Jan 2018 01:31:38 -0800 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: Hi, Yacine, Just on a side note, you can set idf=False in the Tfidf and only normalize the vectors by their L2 norm. But yeah, the normalization you suggest might be really handy in certain cases. I am not sure though if it's worth making this another parameter in the CountVectorizer (which already has quite a lot of parameters), as it can be computed quite easily if I am not misinterpreting something. Since the length of each document is determined by the sum of the words in each vector, one could simply normalize it by the document length as follows: > from sklearn.feature_extraction.text import CountVectorizer > dataset = ['The sun is shining and the weather is sweet', > 'Hello World. The sun is shining and the weather is sweet'] > > vect = CountVectorizer() > vect.fit(dataset) > transf = vect.transform(dataset) > normalized_word_vectors = transf / transf.sum(axis=1) Where it would be tricky though is when you remove stop words during preprocessing but want to include them in the normalization. Then, you might have to do sth like this: > from sklearn.feature_extraction.text import CountVectorizer > import numpy as np > > dataset = ['The sun is shining and the weather is sweet', > 'Hello World. The sun is shining and the weather is sweet'] > > counts = np.array([len(s.split()) for s in dataset]).reshape(-1, 1) > vect = CountVectorizer(stop_words='english') > vect.fit(dataset) > transf = vect.transform(dataset) > transf / counts Best, Sebastian > On Jan 27, 2018, at 11:31 PM, Yacine MAZARI wrote: > > Hi Jake, > > Thanks for the quick reply. > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. > But still, tf is deined here as the raw count of a term in the dicument. > > What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. > On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > When can this be useful? Here is an example: > Say term t occurs 5 times in document d1, and also 5 times in document d2. > At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the ?importance? and information carried by the same term in the two documents is not the same. > If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04. > > There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this. > > Regards, > Yacine. > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas wrote: > Hi Yacine, > If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer. > > Best, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI wrote: > Hello, > > I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer". > > In the current implementation, the definition of term frequency is the number of times a term t occurs in document d. > > However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length. > > I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. > If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()". > > What do you think? > If this sounds reasonable an worth it, I will send a PR. > > Thank you, > Yacine. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Sun Jan 28 04:36:47 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 28 Jan 2018 01:36:47 -0800 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: <6D505084-C51E-4EC3-B818-2715E016116C@gmail.com> Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian > On Jan 28, 2018, at 1:29 AM, Joel Nothman wrote: > > sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer. > > I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards. > > On 28 January 2018 at 18:31, Yacine MAZARI wrote: > Hi Jake, > > Thanks for the quick reply. > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. > But still, tf is deined here as the raw count of a term in the dicument. > > What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. > On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > When can this be useful? Here is an example: > Say term t occurs 5 times in document d1, and also 5 times in document d2. > At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the ?importance? and information carried by the same term in the two documents is not the same. > If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04. > > There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this. > > Regards, > Yacine. > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas wrote: > Hi Yacine, > If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer. > > Best, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI wrote: > Hello, > > I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer". > > In the current implementation, the definition of term frequency is the number of times a term t occurs in document d. > > However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length. > > I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. > If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()". > > What do you think? > If this sounds reasonable an worth it, I will send a PR. > > Thank you, > Yacine. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sun Jan 28 04:56:28 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 28 Jan 2018 20:56:28 +1100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: <6D505084-C51E-4EC3-B818-2715E016116C@gmail.com> References: <6D505084-C51E-4EC3-B818-2715E016116C@gmail.com> Message-ID: That's equivalent to Normalizer(norm='l1') or FunctionTransformer(np.linalg.norm, kw_args={'axis': 1, 'ord': 1}). The problem is that length norm followed by TfidfTransformer now can't do sublinear TF right... But that's alright if we know we can always do FunctionTransformer(lambda X: calc_sublinear(X) / X.sum(axis=1)), perhaps then followed by applying IDF from TfidfTransformer. Yes, it's not straightforward, but it's very hard to provide a library that suits everyone's needs... so FunctionTransformer and Pipeline are your friends :) On 28 January 2018 at 20:36, Sebastian Raschka wrote: > Good point Joel, and I actually forgot that you can set the norm param in > the TfidfVectorizer, so one could basically do > > vect = TfidfVectorizer(use_idf=False, norm='l1') > > to have the CountVectorizer behavior but normalizing by the document > length. > > Best, > Sebastian > > > On Jan 28, 2018, at 1:29 AM, Joel Nothman > wrote: > > > > sklearn.preprocessing.Normalizer allows you to normalize any vector by > its L1 or L2 norm. L1 would be equivalent to "document length" as long as > you did not intend to count stop words in the length. > sklearn.feature_extraction.text.TfidfTransformer offers similar norming, > but does so only after accounting for IDF or TF transformation. Since the > length normalisation transformation is stateless, it can also be computed > with a sklearn.preprocessing.FunctionTransformer. > > > > I can't say it's especially obvious that these features available, and > improvements to the documentation are welcome, but CountVectorizer is > complicated enough and we would rather avoid more parameters if we can. I > wouldn't hate if length normalisation was added to TfidfTransformer, if it > was shown that normalising before IDF multiplication was more effective > than (or complementary to) norming afterwards. > > > > On 28 January 2018 at 18:31, Yacine MAZARI wrote: > > Hi Jake, > > > > Thanks for the quick reply. > > > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which > badically means normalizing the counts by document frequencies, tf * idf. > > But still, tf is deined here as the raw count of a term in the dicument. > > > > What I am suggesting, is to add the possibility to use another > definition of tf, tf= relative frequency of a term in a document = raw > counts / document length. > > On top of this, one could further normalize by IDF to get the TF-IDF ( > https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > > > When can this be useful? Here is an example: > > Say term t occurs 5 times in document d1, and also 5 times in document > d2. > > At first glance, it seems that the term conveys the same information > about both documents. But if we also check document lengths, and find that > length of d1 is 20, wheras lenght of d2 is 200, then probably the > ?importance? and information carried by the same term in the two documents > is not the same. > > If we use relative frequency instead of absolute counts, then > tf1=5/20=0.4 whereas tf2=5/200=0.04. > > > > There are many practical cases (document similarity, document > classification, etc...) where using relative frequencies yields better > results, and it might be worth making the CountVectorizer support this. > > > > Regards, > > Yacine. > > > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas < > jakevdp at cs.washington.edu> wrote: > > Hi Yacine, > > If I'm understanding you correctly, I think what you have in mind is > already implemented in scikit-learn in the TF-IDF vectorizer. > > > > Best, > > Jake > > > > Jake VanderPlas > > Senior Data Science Fellow > > Director of Open Software > > University of Washington eScience Institute > > > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI > wrote: > > Hello, > > > > I would like to work on adding an additional feature to > "sklearn.feature_extraction.text.CountVectorizer". > > > > In the current implementation, the definition of term frequency is the > number of times a term t occurs in document d. > > > > However, another definition that is very commonly used in practice is > the term frequency adjusted for document length, i.e: tf = raw counts / > document length. > > > > I intend to implement this by adding an additional boolean parameter > "relative_frequency" to the constructor of CountVectorizer. > > If the parameter is true, normalize X by document length (along x=1) in > "CountVectorizer.fit_transform()". > > > > What do you think? > > If this sounds reasonable an worth it, I will send a PR. > > > > Thank you, > > Yacine. > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sun Jan 28 04:34:26 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 28 Jan 2018 10:34:26 +0100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: Message-ID: <20180128093426.GC2329564@phare.normalesup.org> On Sun, Jan 28, 2018 at 08:29:58PM +1100, Joel Nothman wrote: > I can't say it's especially obvious that these features available, and > improvements to the documentation are welcome, but CountVectorizer is > complicated enough and we would rather avoid more parameters if we can. Same feeling here. I am afraid of the crowing effect that makes it harder and harder to find things as we add them. Ga?l From y.mazari at gmail.com Mon Jan 29 10:39:35 2018 From: y.mazari at gmail.com (Yacine MAZARI) Date: Tue, 30 Jan 2018 00:39:35 +0900 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: <20180128093426.GC2329564@phare.normalesup.org> References: <20180128093426.GC2329564@phare.normalesup.org> Message-ID: Hi Folks, Thank you all for the feedback and interesting discussion. I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so. Let me try to address your comments here, and make one final case for the value of this feature: 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform normalization of CountVectorizer result: That would require an additional pass on the data. True that's "only" O(N), but if there is a way to speed up training an ML model, that'd be an advantage. 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold normalization. If one needs TF-IDF (with normalized document counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would be required to get IDF normalization, bringing us to a case similar to the above. 3) >> I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to: - NLTK , which is using document-length-normalized term frequencies. - Manning and Sch?tze's Introduction to Information Retrieval : "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 7 also apply here." On the other hand, applying this kind of normalization to a corpus where the document lengths are similar (such as tweets) will probably not be of any advantage. 4) This will be a handy feature as Sebastian mentioned, and the code change will be very small (careful here...any code change brings risks). What do you think? Best regards, Yacine. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jan 29 15:27:42 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 30 Jan 2018 07:27:42 +1100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: <20180128093426.GC2329564@phare.normalesup.org> Message-ID: I don't think you will do this without an O(N) cost. The fact that it's done with a second pass is moot. My position stands: if this change happens, it should be to TfidfTransformer (which should perhaps be called something like CountVectorWeighter!) alone. On 30 January 2018 at 02:39, Yacine MAZARI wrote: > Hi Folks, > > Thank you all for the feedback and interesting discussion. > > I do realize that adding a feature comes with risks, and that there should > really be compelling reasons to do so. > > Let me try to address your comments here, and make one final case for the > value of this feature: > > 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform > normalization of CountVectorizer result: That would require an additional > pass on the data. True that's "only" O(N), but if there is a way to speed > up training an ML model, that'd be an advantage. > > 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the > same effect; but not that this not TF-IDF any more, in that TF-IDF is a > two-fold normalization. If one needs TF-IDF (with normalized document > counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) > would be required to get IDF normalization, bringing us to a case similar > to the above. > > 3) > >> I wouldn't hate if length normalisation was added to TfidfTransformer, > if it was shown that normalising before IDF multiplication was more > effective than (or complementary >> to) norming afterwards. > I think this is one of the most important points here. > Though not a formal proof, I can for example refer to: > > - NLTK , > which is using document-length-normalized term frequencies. > > > - Manning and Sch?tze's Introduction to Information Retrieval > : > "The same considerations that led us to prefer weighted representations, in > particular length-normalized tf-idf representations, in Chapters 6 > > 7 > > also apply here." > > On the other hand, applying this kind of normalization to a corpus where > the document lengths are similar (such as tweets) will probably not be of > any advantage. > > 4) This will be a handy feature as Sebastian mentioned, and the code > change will be very small (careful here...any code change brings risks). > > What do you think? > > Best regards, > Yacine. > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amir.p.shanehsazzadeh at umasd.net Tue Jan 30 07:24:50 2018 From: amir.p.shanehsazzadeh at umasd.net (AMIR SHANEHSAZZADEH) Date: Tue, 30 Jan 2018 07:24:50 -0500 Subject: [scikit-learn] DBSCAN Border Points Message-ID: Hello, I am working with the latest implementation of DBSCAN. I believe that scikit-learn's implementation does not include non-core points in clusters. This results in border points not being included in clusters. Is there any way to remedy this issue so that border points are included in their respective clusters? Do you know what modifications I would need to make the code? Thank you, Amir Shanehsazzadeh -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.mazari at gmail.com Tue Jan 30 10:19:32 2018 From: y.mazari at gmail.com (Yacine MAZARI) Date: Wed, 31 Jan 2018 00:19:32 +0900 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: <20180128093426.GC2329564@phare.normalesup.org> Message-ID: Okay, thanks for the replies. @Joel: Should I go ahead and send a PR with the change to TfidfTransformer? On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman wrote: > I don't think you will do this without an O(N) cost. The fact that it's > done with a second pass is moot. > > My position stands: if this change happens, it should be to > TfidfTransformer (which should perhaps be called something like > CountVectorWeighter!) alone. > > On 30 January 2018 at 02:39, Yacine MAZARI wrote: > >> Hi Folks, >> >> Thank you all for the feedback and interesting discussion. >> >> I do realize that adding a feature comes with risks, and that there >> should really be compelling reasons to do so. >> >> Let me try to address your comments here, and make one final case for the >> value of this feature: >> >> 1) Use Normalizer, FunctionTransformer (or write a custom code) to >> perform normalization of CountVectorizer result: That would require an >> additional pass on the data. True that's "only" O(N), but if there is a way >> to speed up training an ML model, that'd be an advantage. >> >> 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the >> same effect; but not that this not TF-IDF any more, in that TF-IDF is a >> two-fold normalization. If one needs TF-IDF (with normalized document >> counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) >> would be required to get IDF normalization, bringing us to a case similar >> to the above. >> >> 3) >> >> I wouldn't hate if length normalisation was added to TfidfTransformer, >> if it was shown that normalising before IDF multiplication was more >> effective than (or complementary >> to) norming afterwards. >> I think this is one of the most important points here. >> Though not a formal proof, I can for example refer to: >> >> - NLTK , >> which is using document-length-normalized term frequencies. >> >> >> - Manning and Sch?tze's Introduction to Information Retrieval >> : >> "The same considerations that led us to prefer weighted representations, in >> particular length-normalized tf-idf representations, in Chapters 6 >> >> 7 >> >> also apply here." >> >> On the other hand, applying this kind of normalization to a corpus where >> the document lengths are similar (such as tweets) will probably not be of >> any advantage. >> >> 4) This will be a handy feature as Sebastian mentioned, and the code >> change will be very small (careful here...any code change brings risks). >> >> What do you think? >> >> Best regards, >> Yacine. >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Jan 30 14:33:42 2018 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 30 Jan 2018 20:33:42 +0100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: References: <20180128093426.GC2329564@phare.normalesup.org> Message-ID: <6efb3822-c85f-a273-b5c5-5fdb9bce2337@gmail.com> Hi Yacine, On 29/01/18 16:39, Yacine MAZARI wrote: > >> I wouldn't hate if length normalisation was added to > if it was shown that normalising before IDF > multiplication was more effective than (or complementary >> to) norming > afterwards. > I think this is one of the most important points here. > Though not a formal proof, I can for example refer to: > > * NLTK > , > which is using document-length-normalized term frequencies. > > * Manning and Sch?tze's Introduction to Information Retrieval > : > "The same considerations that led us to prefer weighted > representations, in particular length-normalized tf-idf > representations, in Chapters 6 7 also apply here." I believe the conclusion of the Manning's Chapter 6 is the following table with TF-IDF weighting schemes https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html in which the document length normalization is applied _after_ the IDF. So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as previously mentioned (at least, if you measure the document length as the number of words it contains). More generally a weighting & normalization transformer for some of the other configurations in that table is implemented in http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html With respect to the NLTK implementation, see https://github.com/nltk/nltk/pull/979#issuecomment-102296527 So I don't think there is a need to change anything in TfidfTransformer... -- Roman From joel.nothman at gmail.com Tue Jan 30 18:17:01 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 31 Jan 2018 10:17:01 +1100 Subject: [scikit-learn] DBSCAN Border Points In-Reply-To: References: Message-ID: It includes non-core points, but not points that are out of eps from any core point. You can modify eps and min_samples. But perhaps you should just choose a different clustering algorithm if this is behaviour you absolutely do not want. On 30 January 2018 at 23:24, AMIR SHANEHSAZZADEH < amir.p.shanehsazzadeh at umasd.net> wrote: > Hello, > > I am working with the latest implementation of DBSCAN. I believe that > scikit-learn's implementation does not include non-core points in clusters. > This results in border points not being included in clusters. Is there any > way to remedy this issue so that border points are included in their > respective clusters? Do you know what modifications I would need to make > the code? > > Thank you, > Amir Shanehsazzadeh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jan 30 18:20:54 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 31 Jan 2018 10:20:54 +1100 Subject: [scikit-learn] CountVectorizer: Additional Feature Suggestion In-Reply-To: <6efb3822-c85f-a273-b5c5-5fdb9bce2337@gmail.com> References: <20180128093426.GC2329564@phare.normalesup.org> <6efb3822-c85f-a273-b5c5-5fdb9bce2337@gmail.com> Message-ID: A very good point! (Although augmented and log-average tf both do some kind of normalisation of the tf distribution before IDF weighting.) -------------- next part -------------- An HTML attachment was scrubbed... URL: