[scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets

Sun Mar 31 14:57:16 EDT 2019

Hi Andreas,

the best score is determined by computing the test fold performance (I think R^2 by default) and then averaging over them. Since you chose cv=10, you have 10 test folds, and the performance is the average performance over those for choosing the best hyper parameter setting. 

Then, it looks like you are computing the performance manually:

> simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr)

on the whole training set. Instead, I would take a look at the simple_tree.best_score_ attribute after fitting. If you do 

Best,
Sebastian

> On Mar 31, 2019, at 5:15 AM, Andreas Tosstorff <andt88 at hotmail.com> wrote:
> 
> Dear all,
> I am new to scikit learn so please excuse my ignorance. Using GridsearchCV I am trying to optimize a DecisionTreeRegressor. The broader I make the parameter space, the worse the scoring gets.
> Setting min_samples_split to range(2,10) gives me a neg_mean_squared_error of -0.04. When setting it to range(2,5) The score is -0.004.
> simple_tree =GridSearchCV(tree.DecisionTreeRegressor(random_state=42), n_jobs=4, param_grid={'min_samples_split': range(2, 10)}, scoring='neg_mean_squared_error', cv=10, refit='neg_mean_squared_error')
> 
> simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr)
> 
> I expect an equal or more positive score for a more extensive grid search compared to the less extensive one.
> 
> I would really appreciate your help!
> 
> Kind regards,
> Andreas
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn