<div dir="ltr">Hi Michał,<div><br></div><div>What are the class counts in that set? Maybe there is a problem with generating stratified subsamples (eg some classes get below 1 sample)?</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature">----<br>Pozdrawiam,  |  Best regards,<br>Maciek Wójcikowski<br><a href="mailto:maciek@wojcikowski.pl" target="_blank">maciek@wojcikowski.pl</a><br></div></div>
<br><div class="gmail_quote">2016-07-08 17:22 GMT+02:00 Michał Nowotka <span dir="ltr"><<a href="mailto:mmmnow@gmail.com" target="_blank">mmmnow@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
Sorry for cross posting<br>
(<a href="http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample" rel="noreferrer" target="_blank">http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample</a>)<br>
but I don't know where is better to get help with my problem.<br>
I'm working on a VM with Jupyter notebook server installed.<br>
>From time to time I add new notebooks and reevaluate old ones to see<br>
if they still work.<br>
<br>
This notebook stopped working due to some changes in scikit-learn API<br>
and some parameters become obsolete:<br>
<br>
<a href="https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb" rel="noreferrer" target="_blank">https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb</a><br>
<br>
I've created a corrected version of the notebook here:<br>
<br>
<a href="https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433" rel="noreferrer" target="_blank">https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433</a><br>
<br>
But I'm stuck in cell 36 on this code:<br>
<br>
from sklearn.cross_validation import KFold<br>
from sklearn.grid_search import GridSearchCV<br>
<br>
X_traina, X_testa, y_traina, y_testa =<br>
cross_validation.train_test_split(x, y, test_size=0.95,<br>
random_state=23)<br>
<br>
params = {'min_samples_split': [8], 'max_depth': [20],<br>
'min_samples_leaf': [1],'n_estimators':[200]}<br>
cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)<br>
cv_stratified = StratifiedKFold(y_traina, n_folds=5)<br>
gs = GridSearchCV(custom_forest, params, cv=cv_stratified,verbose=1,refit=True)<br>
gs.fit(X_traina,y_traina)<br>
<br>
This gives me:<br>
<br>
ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a<br>
minimum of 1 is required.<br>
<br>
Now I don't understand this because when I print shapes of the samples:<br>
<br>
print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)<br>
<br>
I'm getting:<br>
<br>
((78, 491), (1489, 491), (78,), (1489,))<br>
<br>
Interestingly, if I change the test_size parameter to 0.88 (like in<br>
the example corrected notebook) it works and this is the highest value<br>
where it works. For this value, the shapes are:<br>
<br>
((188, 491), (1379, 491), (188,), (1379,))<br>
<br>
So the question is - what should I change in my code to make it work<br>
for test_size set to 0.95 as well?<br>
<br>
Kind regards,<br>
<br>
Michal Nowotka<br>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div><br></div>