[scikit-learn] Inconsistent Logistic Regression fit results
Gael Varoquaux
gael.varoquaux at normalesup.org
Wed Aug 17 03:23:12 EDT 2016
In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning.
Not a bug. An expected behavior.
Sent from my phone. Please forgive brevity and mis spelling
On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <chris at upnix.com> wrote:
>Thank you everyone for your help. The short version of this email is
>that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem -
>but only if I upped “max_iter” to 1000.
>Longer version -
>Without max_iter=1000, I would get the warning:
>ConvergenceWarning: The max_iter was reached which means the coef_ did
>not converge
>I have some columns in my data that have a huge range of values. Using
>“liblinear”, if I transformed those columns, causing the range to be
>smaller, the results would be consistent every time.
>This is the function I ended up using -
>def log_run(logreg_x, logreg_y):
> logreg_x['pass_fail'] = logreg_y
>df_train, df_test, y_train, y_test = train_test_split(logreg_x,
>logreg_y, random_state=0)
> del(df_train['pass_fail'])
> del(df_test['pass_fail'])
> log_reg_fit = LogisticRegression(class_weight='balanced',
> tol=0.00000001,
> random_state=8,
> solver='sag',
> max_iter=1000).fit(df_train.values, y_train)
> predicted = log_reg_fit.predict(df_test.values)
> accuracy = accuracy_score(y_test, predicted)
> kappa = cohen_kappa_score(y_test, predicted)
> return [kappa, accuracy]
>Thank you again for the help,
>> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
>> hm, was worth a try. What happens if you change the solver to
>something else than liblinear, does this issue still persist?
>> Btw. scikit-learn works with NumPy arrays, not NumPy matrices.
>Probably unrelated to your issue, I’d recommend setting
>>> y_train = df_train.pass_fail.values
>>> y_test = df_test.pass_fail.values
>> instead of
>>> y_train = df_train.pass_fail.as_matrix()
>>> y_test = df_test.pass_fail.as_matrix()
>> Also, try passing NumPy arrays to the fit method:
>>> log_reg_fit = LogisticRegression(...).fit(df_train.values,
>> and
>>> predicted = log_reg_fit.predict(df_test.values)
>> and so forth.
>>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>>> Sebastian,
>>> That doesn’t do it. With the function:
>>> def log_run(logreg_x, logreg_y):
>>> logreg_x['pass_fail'] = logreg_y
>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>> y_train = df_train.pass_fail.as_matrix()
>>> y_test = df_test.pass_fail.as_matrix()
>>> del(df_train['pass_fail'])
>>> del(df_test['pass_fail'])
>>> log_reg_fit = LogisticRegression(class_weight='balanced',
>>> tol=0.000000001,
>>> random_state=0).fit(df_train,
>>> predicted = log_reg_fit.predict(df_test)
>>> accuracy = accuracy_score(y_test, predicted)
>>> kappa = cohen_kappa_score(y_test, predicted)
>>> return [kappa, accuracy]
>>> I’m still seeing:
>>> log_run(df_save, y)
>>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>>> log_run(df_save, y)
>>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>>> log_run(df_save, y)
>>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>>> log_run(df_save, y)
>>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>>> Chris
>>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>>> Hi, Chris,
>>>> have you set the random seed to a specific, contant integer value?
>Note that the default in LogisticRegression is random_state=None.
>Setting it to some arbitrary number like 123 may help if you haven’t
>done so, yet.
>>>> Best,
>>>> Sebastian
>>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com>
>>>>> Hi all,
>>>>> Using the same X and y values
>sklearn.linear_model.LogisticRegression.fit() is providing me with
>inconsistent results.
>>>>> The documentation for sklearn.linear_model.LogisticRegression
>states that "It is thus not uncommon, to have slightly different
>results for the same input data.” I am experiencing this, however the
>fix of using a smaller “tol” parameter isn’t providing me with
>consistent fit.
>>>>> The code I’m using:
>>>>> def log_run(logreg_x, logreg_y):
>>>>> logreg_x['pass_fail'] = logreg_y
>>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>>> y_train = df_train.pass_fail.as_matrix()
>>>>> y_test = df_test.pass_fail.as_matrix()
>>>>> del(df_train['pass_fail'])
>>>>> del(df_test['pass_fail'])
>>>>> log_reg_fit =
>>>>> predicted = log_reg_fit.predict(df_test)
>>>>> accuracy = accuracy_score(y_test, predicted)
>>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>>> return [kappa, accuracy]
>>>>> I’ve gone out of my way to be sure the test and train data is the
>same for each run, so I don’t think there should be random shuffling
>going on.
>>>>> Example output:
>>>>> ---
>>>>> log_run(df_save, y)
>>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>>> log_run(df_save, y)
>>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>>> log_run(df_save, y)
>>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>>> log_run(df_save, y)
>>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>>> log_run(df_save, y)
>>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>>> log_run(df_save, y)
>>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>>> A little information on the problem DataFrame:
>>>>> ---
>>>>> len(df_save)
>>>>> Out[40]: 240
>>>>> len(df_save.columns)
>>>>> Out[41]: 18
>>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>>> df_save[‘abc'].head()
>>>>> Out[42]:
>>>>> 0 0.026316
>>>>> 1 0.333333
>>>>> 2 0.015152
>>>>> 3 0.010526
>>>>> 4 0.125000
>>>>> Name: abc, dtype: float64
>>>>> Does anyone have ideas on how I can figure this out? Is there some
>randomness/shuffling still going on I missed?
>>>>> Thanks!
>>>>> Chris
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>scikit-learn mailing list
>scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160817/96633586/attachment-0001.html>
More information about the scikit-learn
mailing list