[scikit-learn] Bm25 pull request
Basil Beirouti
basilbeirouti at gmail.com
Mon Jul 11 18:11:18 EDT 2016
Hi,
Joel thanks for pointing out the indentation issue. I have fixed it.
Can someone explain what the 3 tests that were automatically run on my code are? And why did the Appveyor and Travis ones fail?
Sincerely,
Basil Beirouti
Sent from my iPhone
> On Jul 11, 2016, at 11:00 AM, scikit-learn-request at python.org wrote:
>
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: Scikit learn GridSearchCV fit method ValueError Found
> array with 0 sample (Maciek W?jcikowski)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 11 Jul 2016 13:33:28 +0200
> From: Maciek W?jcikowski <maciek at wojcikowski.pl>
> To: Scikit-learn user and developer mailing list
> <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Scikit learn GridSearchCV fit method
> ValueError Found array with 0 sample
> Message-ID:
> <CAH2JJR1BqHC0PzNv7uaugkQ9GDBUTev4yuJ1qOWuJa=eWZ1wnQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Shouldn't you pass labels (binary) instead of continuous data? If you wish
> to stick to logK's and keep the distribution unchanged then you'd better
> reduce the number of classes (eg round the values to nearest integer?).
>
> It might be the case that the counts per class are floored and you get 0
> for some cases.
>
> ----
> Pozdrawiam, | Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
>
> 2016-07-11 13:16 GMT+02:00 Micha? Nowotka <mmmnow at gmail.com>:
>
>> Hi Maciek,
>>
>> Thanks for suggestion, I think the problem indeed is related to the
>> StratifiedKFold because if I use KFold instead the code works fine.
>> However, if I print StratifiedKFold object it looks fine to me:
>>
>> sklearn.cross_validation.StratifiedKFold(labels=[ 5.43 8.74 8.1
>> 6.55 7.66 6.52 8.6 7.1 6.4 8.05 7.89 6.68
>> 8.06 6.17 5.5 7.96 5.78 6. 7.74 5.83 6.51 6.31 6.68 9.22
>> 6.07 7.06 7.12 8.64 5.72 6.4 7.64 5.74 7.41 6.49 6.81 7.1
>> 7.66 6.68 7.05 6.28 5.49 6.35 6.9 6.2 7.51 5.65 9.3 5.84
>> 6.92 5.75 6.92 8.8 7.04 5.81 5.73 5.31 7.13 7.66 6.98 5.93
>> 8.24 6.96 8.22 7.27 7.34 5.91 5.57 6.5 7.28 6.74 4.92 6.88
>> 5.8 9.15 6.63 6.37 8.66 6.4 ], n_folds=5, shuffle=False,
>> random_state=None)
>>
>>
>> On Fri, Jul 8, 2016 at 10:42 PM, Maciek W?jcikowski
>> <maciek at wojcikowski.pl> wrote:
>>> Hi Micha?,
>>>
>>> What are the class counts in that set? Maybe there is a problem with
>>> generating stratified subsamples (eg some classes get below 1 sample)?
>>>
>>> ----
>>> Pozdrawiam, | Best regards,
>>> Maciek W?jcikowski
>>> maciek at wojcikowski.pl
>>>
>>> 2016-07-08 17:22 GMT+02:00 Micha? Nowotka <mmmnow at gmail.com>:
>>>>
>>>> Hi,
>>>>
>>>> Sorry for cross posting
>>>>
>>>> (
>> http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample
>> )
>>>> but I don't know where is better to get help with my problem.
>>>> I'm working on a VM with Jupyter notebook server installed.
>>>> From time to time I add new notebooks and reevaluate old ones to see
>>>> if they still work.
>>>>
>>>> This notebook stopped working due to some changes in scikit-learn API
>>>> and some parameters become obsolete:
>> https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb
>>>>
>>>> I've created a corrected version of the notebook here:
>>>>
>>>> https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433
>>>>
>>>> But I'm stuck in cell 36 on this code:
>>>>
>>>> from sklearn.cross_validation import KFold
>>>> from sklearn.grid_search import GridSearchCV
>>>>
>>>> X_traina, X_testa, y_traina, y_testa =
>>>> cross_validation.train_test_split(x, y, test_size=0.95,
>>>> random_state=23)
>>>>
>>>> params = {'min_samples_split': [8], 'max_depth': [20],
>>>> 'min_samples_leaf': [1],'n_estimators':[200]}
>>>> cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)
>>>> cv_stratified = StratifiedKFold(y_traina, n_folds=5)
>>>> gs = GridSearchCV(custom_forest, params,
>>>> cv=cv_stratified,verbose=1,refit=True)
>>>> gs.fit(X_traina,y_traina)
>>>>
>>>> This gives me:
>>>>
>>>> ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a
>>>> minimum of 1 is required.
>>>>
>>>> Now I don't understand this because when I print shapes of the samples:
>>>>
>>>> print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)
>>>>
>>>> I'm getting:
>>>>
>>>> ((78, 491), (1489, 491), (78,), (1489,))
>>>>
>>>> Interestingly, if I change the test_size parameter to 0.88 (like in
>>>> the example corrected notebook) it works and this is the highest value
>>>> where it works. For this value, the shapes are:
>>>>
>>>> ((188, 491), (1379, 491), (188,), (1379,))
>>>>
>>>> So the question is - what should I change in my code to make it work
>>>> for test_size set to 0.95 as well?
>>>>
>>>> Kind regards,
>>>>
>>>> Michal Nowotka
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160711/d66aa81c/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 4, Issue 15
> *******************************************
More information about the scikit-learn
mailing list