<div dir="ltr">Hi Michał,<div><br></div><div>What are the class counts in that set? Maybe there is a problem with generating stratified subsamples (eg some classes get below 1 sample)?</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature">----<br>Pozdrawiam,  |  Best regards,<br>Maciek Wójcikowski<br><a href="mailto:maciek@wojcikowski.pl" target="_blank">maciek@wojcikowski.pl</a><br></div></div>

<br><div class="gmail_quote">2016-07-08 17:22 GMT+02:00 Michał Nowotka <span dir="ltr"><<a href="mailto:mmmnow@gmail.com" target="_blank">mmmnow@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

Sorry for cross posting<br>

(<a href="http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample" rel="noreferrer" target="_blank">http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample</a>)<br>

but I don't know where is better to get help with my problem.<br>

I'm working on a VM with Jupyter notebook server installed.<br>

>From time to time I add new notebooks and reevaluate old ones to see<br>

if they still work.<br>

<br>

This notebook stopped working due to some changes in scikit-learn API<br>

and some parameters become obsolete:<br>

<br>

<a href="https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb" rel="noreferrer" target="_blank">https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb</a><br>

<br>

I've created a corrected version of the notebook here:<br>

<br>

<a href="https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433" rel="noreferrer" target="_blank">https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433</a><br>

<br>

But I'm stuck in cell 36 on this code:<br>

<br>

from sklearn.cross_validation import KFold<br>

from sklearn.grid_search import GridSearchCV<br>

<br>

X_traina, X_testa, y_traina, y_testa =<br>

cross_validation.train_test_split(x, y, test_size=0.95,<br>

random_state=23)<br>

<br>

params = {'min_samples_split': [8], 'max_depth': [20],<br>

'min_samples_leaf': [1],'n_estimators':[200]}<br>

cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)<br>

cv_stratified = StratifiedKFold(y_traina, n_folds=5)<br>

gs = GridSearchCV(custom_forest, params, cv=cv_stratified,verbose=1,refit=True)<br>

gs.fit(X_traina,y_traina)<br>

<br>

This gives me:<br>

<br>

ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a<br>

minimum of 1 is required.<br>

<br>

Now I don't understand this because when I print shapes of the samples:<br>

<br>

print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)<br>

<br>

I'm getting:<br>

<br>

((78, 491), (1489, 491), (78,), (1489,))<br>

<br>

Interestingly, if I change the test_size parameter to 0.88 (like in<br>

the example corrected notebook) it works and this is the highest value<br>

where it works. For this value, the shapes are:<br>

<br>

((188, 491), (1379, 491), (188,), (1379,))<br>

<br>

So the question is - what should I change in my code to make it work<br>

for test_size set to 0.95 as well?<br>

<br>

Kind regards,<br>

<br>

Michal Nowotka<br>

_______________________________________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

</blockquote></div><br></div>