[scikit-learn] Added BM25Transformer and BM25Vectorizer to sklearn.feature_extraction.text

Basil Beirouti basilbeirouti at gmail.com
Sun Jul 10 17:44:30 EDT 2016


Hi all,

I have submitted a pull request to the main branch. I added BM25Transformer
and BM25Vectorizer, which are very similar to TFIDFTransformer and
TFIDFVectorizer, except they implement the BM25 algorithm instead. Would
really appreciate feedback on the quality of my work and how I can improve.

Sincerely,
Basil Beirouti

On Sat, Jul 9, 2016 at 11:00 AM, <scikit-learn-request at python.org> wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Create a "Feature_Weight" Parameter at
>       RandomForestRegressor (Andreas Mueller)
>    2. Re: Scikit learn GridSearchCV fit method ValueError Found
>       array with 0 sample (Maciek W?jcikowski)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 8 Jul 2016 17:00:42 -0400
> From: Andreas Mueller <t3kcit at gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Create a "Feature_Weight" Parameter at
>         RandomForestRegressor
> Message-ID: <5780147A.5040901 at gmail.com>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> You would need to implement a custom splitter, I think.
>
> On 07/04/2016 04:09 PM, luizfgoncalves at dcc.ufmg.br wrote:
> > I would like to give different weights to the features in the feature set
> > for the split task of Random Forest. Right now, only the MSE metric is
> > used to select the best split, and I want to do something like feature[i]
> > = MSI[i] * feature_weight[i]. This way, I'll be able to give more
> > importance to the features I already know that are better.
> >
> > In my mind, this change would be called on the fit function, something
> > like this: def fit(self, X, y, sample_weight, feature_weight):
> > And the feature_weight would be a vector with customized weights for all
> > features present in the dataset.
> >
> > What is the best way to do that? I'm having a really hard time figuring
> > out how to do this changes on the code.
> > Thanks a lot for your attention.
> >
> > Luiz Felipe
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 8 Jul 2016 23:42:06 +0200
> From: Maciek W?jcikowski <maciek at wojcikowski.pl>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Scikit learn GridSearchCV fit method
>         ValueError Found array with 0 sample
> Message-ID:
>         <
> CAH2JJR35CFDJPqTNFn7+uSCVKUVJEPM9mjYDwLTgkipLeWcVCw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Micha?,
>
> What are the class counts in that set? Maybe there is a problem with
> generating stratified subsamples (eg some classes get below 1 sample)?
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
>
> 2016-07-08 17:22 GMT+02:00 Micha? Nowotka <mmmnow at gmail.com>:
>
> > Hi,
> >
> > Sorry for cross posting
> > (
> >
> http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample
> > )
> > but I don't know where is better to get help with my problem.
> > I'm working on a VM with Jupyter notebook server installed.
> > From time to time I add new notebooks and reevaluate old ones to see
> > if they still work.
> >
> > This notebook stopped working due to some changes in scikit-learn API
> > and some parameters become obsolete:
> >
> >
> >
> https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb
> >
> > I've created a corrected version of the notebook here:
> >
> > https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433
> >
> > But I'm stuck in cell 36 on this code:
> >
> > from sklearn.cross_validation import KFold
> > from sklearn.grid_search import GridSearchCV
> >
> > X_traina, X_testa, y_traina, y_testa =
> > cross_validation.train_test_split(x, y, test_size=0.95,
> > random_state=23)
> >
> > params = {'min_samples_split': [8], 'max_depth': [20],
> > 'min_samples_leaf': [1],'n_estimators':[200]}
> > cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)
> > cv_stratified = StratifiedKFold(y_traina, n_folds=5)
> > gs = GridSearchCV(custom_forest, params,
> > cv=cv_stratified,verbose=1,refit=True)
> > gs.fit(X_traina,y_traina)
> >
> > This gives me:
> >
> > ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a
> > minimum of 1 is required.
> >
> > Now I don't understand this because when I print shapes of the samples:
> >
> > print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)
> >
> > I'm getting:
> >
> > ((78, 491), (1489, 491), (78,), (1489,))
> >
> > Interestingly, if I change the test_size parameter to 0.88 (like in
> > the example corrected notebook) it works and this is the highest value
> > where it works. For this value, the shapes are:
> >
> > ((188, 491), (1379, 491), (188,), (1379,))
> >
> > So the question is - what should I change in my code to make it work
> > for test_size set to 0.95 as well?
> >
> > Kind regards,
> >
> > Michal Nowotka
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20160708/0ce8659a/attachment-0001.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 4, Issue 13
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160710/6ee7ba44/attachment.html>


More information about the scikit-learn mailing list