[scikit-learn] Inconsistent Logistic Regression fit results
Chris Cameron
chris at upnix.com
Tue Aug 16 12:15:38 EDT 2016
Thank you everyone for your help. The short version of this email is that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem - but only if I upped “max_iter” to 1000.
Longer version -
Without max_iter=1000, I would get the warning:
ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
I have some columns in my data that have a huge range of values. Using “liblinear”, if I transformed those columns, causing the range to be smaller, the results would be consistent every time.
This is the function I ended up using -
def log_run(logreg_x, logreg_y):
logreg_x['pass_fail'] = logreg_y
df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0)
del(df_train['pass_fail'])
del(df_test['pass_fail'])
log_reg_fit = LogisticRegression(class_weight='balanced',
tol=0.00000001,
random_state=8,
solver='sag',
max_iter=1000).fit(df_train.values, y_train)
predicted = log_reg_fit.predict(df_test.values)
accuracy = accuracy_score(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
Thank you again for the help,
Chris
> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
>
> hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?
>
>
> Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I’d recommend setting
>
>> y_train = df_train.pass_fail.values
>> y_test = df_test.pass_fail.values
>
> instead of
>
>> y_train = df_train.pass_fail.as_matrix()
>> y_test = df_test.pass_fail.as_matrix()
>
>
> Also, try passing NumPy arrays to the fit method:
>
>> log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
>
> and
>
>> predicted = log_reg_fit.predict(df_test.values)
>
> and so forth.
>
>
>
>
>
>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>>
>> Sebastian,
>>
>> That doesn’t do it. With the function:
>>
>> def log_run(logreg_x, logreg_y):
>> logreg_x['pass_fail'] = logreg_y
>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>> y_train = df_train.pass_fail.as_matrix()
>> y_test = df_test.pass_fail.as_matrix()
>> del(df_train['pass_fail'])
>> del(df_test['pass_fail'])
>> log_reg_fit = LogisticRegression(class_weight='balanced',
>> tol=0.000000001,
>> random_state=0).fit(df_train, y_train)
>> predicted = log_reg_fit.predict(df_test)
>> accuracy = accuracy_score(y_test, predicted)
>> kappa = cohen_kappa_score(y_test, predicted)
>>
>> return [kappa, accuracy]
>>
>> I’m still seeing:
>> log_run(df_save, y)
>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>>
>> log_run(df_save, y)
>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>>
>> log_run(df_save, y)
>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>>
>> log_run(df_save, y)
>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>>
>>
>> Chris
>>
>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>>
>>> Hi, Chris,
>>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>>
>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>>>>
>>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
>>>>
>>>> The code I’m using:
>>>>
>>>> def log_run(logreg_x, logreg_y):
>>>> logreg_x['pass_fail'] = logreg_y
>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>> y_train = df_train.pass_fail.as_matrix()
>>>> y_test = df_test.pass_fail.as_matrix()
>>>> del(df_train['pass_fail'])
>>>> del(df_test['pass_fail'])
>>>> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>>> predicted = log_reg_fit.predict(df_test)
>>>> accuracy = accuracy_score(y_test, predicted)
>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>>
>>>> return [kappa, accuracy]
>>>>
>>>>
>>>> I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
>>>>
>>>> Example output:
>>>> ---
>>>> log_run(df_save, y)
>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>>
>>>> log_run(df_save, y)
>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>>
>>>> log_run(df_save, y)
>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>>
>>>> log_run(df_save, y)
>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>>
>>>> log_run(df_save, y)
>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>>
>>>> log_run(df_save, y)
>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>>
>>>> A little information on the problem DataFrame:
>>>> ---
>>>> len(df_save)
>>>> Out[40]: 240
>>>>
>>>> len(df_save.columns)
>>>> Out[41]: 18
>>>>
>>>>
>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>>
>>>> df_save[‘abc'].head()
>>>> Out[42]:
>>>> 0 0.026316
>>>> 1 0.333333
>>>> 2 0.015152
>>>> 3 0.010526
>>>> 4 0.125000
>>>> Name: abc, dtype: float64
>>>>
>>>>
>>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>>>>
>>>>
>>>> Thanks!
>>>> Chris
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list