[scikit-learn] Inconsistent Logistic Regression fit results
mail at sebastianraschka.com
mail at sebastianraschka.com
Mon Aug 15 17:42:10 EDT 2016
Hi, Chris,
have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best,
Sebastian
> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>
> Hi all,
>
> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>
> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
>
> The code I’m using:
>
> def log_run(logreg_x, logreg_y):
> logreg_x['pass_fail'] = logreg_y
> df_train, df_test = train_test_split(logreg_x, random_state=0)
> y_train = df_train.pass_fail.as_matrix()
> y_test = df_test.pass_fail.as_matrix()
> del(df_train['pass_fail'])
> del(df_test['pass_fail'])
> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
> predicted = log_reg_fit.predict(df_test)
> accuracy = accuracy_score(y_test, predicted)
> kappa = cohen_kappa_score(y_test, predicted)
>
> return [kappa, accuracy]
>
>
> I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
>
> Example output:
> ---
> log_run(df_save, y)
> Out[32]: [0.027777777777777728, 0.53333333333333333]
>
> log_run(df_save, y)
> Out[33]: [0.027777777777777728, 0.53333333333333333]
>
> log_run(df_save, y)
> Out[34]: [0.11347517730496456, 0.58333333333333337]
>
> log_run(df_save, y)
> Out[35]: [0.042553191489361743, 0.55000000000000004]
>
> log_run(df_save, y)
> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>
> log_run(df_save, y)
> Out[37]: [0.042553191489361743, 0.55000000000000004]
>
> A little information on the problem DataFrame:
> ---
> len(df_save)
> Out[40]: 240
>
> len(df_save.columns)
> Out[41]: 18
>
>
> If I omit this particular column the Kappa no longer fluctuates:
>
> df_save[‘abc'].head()
> Out[42]:
> 0 0.026316
> 1 0.333333
> 2 0.015152
> 3 0.010526
> 4 0.125000
> Name: abc, dtype: float64
>
>
> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>
>
> Thanks!
> Chris
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list