[scikit-learn] Inconsistent Logistic Regression fit results
Gael Varoquaux
gael.varoquaux at normalesup.org
Wed Aug 17 03:23:12 EDT 2016
In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning.
Not a bug. An expected behavior.
Sent from my phone. Please forgive brevity and mis spelling
On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <chris at upnix.com> wrote:
>Thank you everyone for your help. The short version of this email is
>that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem -
>but only if I upped “max_iter” to 1000.
>
>
>Longer version -
>Without max_iter=1000, I would get the warning:
>ConvergenceWarning: The max_iter was reached which means the coef_ did
>not converge
>
>I have some columns in my data that have a huge range of values. Using
>“liblinear”, if I transformed those columns, causing the range to be
>smaller, the results would be consistent every time.
>
>This is the function I ended up using -
>def log_run(logreg_x, logreg_y):
> logreg_x['pass_fail'] = logreg_y
>df_train, df_test, y_train, y_test = train_test_split(logreg_x,
>logreg_y, random_state=0)
> del(df_train['pass_fail'])
> del(df_test['pass_fail'])
> log_reg_fit = LogisticRegression(class_weight='balanced',
> tol=0.00000001,
> random_state=8,
> solver='sag',
> max_iter=1000).fit(df_train.values, y_train)
> predicted = log_reg_fit.predict(df_test.values)
> accuracy = accuracy_score(y_test, predicted)
> kappa = cohen_kappa_score(y_test, predicted)
>
> return [kappa, accuracy]
>
>
>Thank you again for the help,
>
>Chris
>
>> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
>>
>> hm, was worth a try. What happens if you change the solver to
>something else than liblinear, does this issue still persist?
>>
>>
>> Btw. scikit-learn works with NumPy arrays, not NumPy matrices.
>Probably unrelated to your issue, I’d recommend setting
>>
>>> y_train = df_train.pass_fail.values
>>> y_test = df_test.pass_fail.values
>>
>> instead of
>>
>>> y_train = df_train.pass_fail.as_matrix()
>>> y_test = df_test.pass_fail.as_matrix()
>>
>>
>> Also, try passing NumPy arrays to the fit method:
>>
>>> log_reg_fit = LogisticRegression(...).fit(df_train.values,
>y_train)
>>
>> and
>>
>>> predicted = log_reg_fit.predict(df_test.values)
>>
>> and so forth.
>>
>>
>>
>>
>>
>>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>>>
>>> Sebastian,
>>>
>>> That doesn’t do it. With the function:
>>>
>>> def log_run(logreg_x, logreg_y):
>>> logreg_x['pass_fail'] = logreg_y
>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>> y_train = df_train.pass_fail.as_matrix()
>>> y_test = df_test.pass_fail.as_matrix()
>>> del(df_train['pass_fail'])
>>> del(df_test['pass_fail'])
>>> log_reg_fit = LogisticRegression(class_weight='balanced',
>>> tol=0.000000001,
>>> random_state=0).fit(df_train,
>y_train)
>>> predicted = log_reg_fit.predict(df_test)
>>> accuracy = accuracy_score(y_test, predicted)
>>> kappa = cohen_kappa_score(y_test, predicted)
>>>
>>> return [kappa, accuracy]
>>>
>>> I’m still seeing:
>>> log_run(df_save, y)
>>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>>>
>>> log_run(df_save, y)
>>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> log_run(df_save, y)
>>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> log_run(df_save, y)
>>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>>>
>>>
>>> Chris
>>>
>>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>>>
>>>> Hi, Chris,
>>>> have you set the random seed to a specific, contant integer value?
>Note that the default in LogisticRegression is random_state=None.
>Setting it to some arbitrary number like 123 may help if you haven’t
>done so, yet.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>>
>>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com>
>wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Using the same X and y values
>sklearn.linear_model.LogisticRegression.fit() is providing me with
>inconsistent results.
>>>>>
>>>>> The documentation for sklearn.linear_model.LogisticRegression
>states that "It is thus not uncommon, to have slightly different
>results for the same input data.” I am experiencing this, however the
>fix of using a smaller “tol” parameter isn’t providing me with
>consistent fit.
>>>>>
>>>>> The code I’m using:
>>>>>
>>>>> def log_run(logreg_x, logreg_y):
>>>>> logreg_x['pass_fail'] = logreg_y
>>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>>> y_train = df_train.pass_fail.as_matrix()
>>>>> y_test = df_test.pass_fail.as_matrix()
>>>>> del(df_train['pass_fail'])
>>>>> del(df_test['pass_fail'])
>>>>> log_reg_fit =
>LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train,
>y_train)
>>>>> predicted = log_reg_fit.predict(df_test)
>>>>> accuracy = accuracy_score(y_test, predicted)
>>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>>>
>>>>> return [kappa, accuracy]
>>>>>
>>>>>
>>>>> I’ve gone out of my way to be sure the test and train data is the
>same for each run, so I don’t think there should be random shuffling
>going on.
>>>>>
>>>>> Example output:
>>>>> ---
>>>>> log_run(df_save, y)
>>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>>>
>>>>> log_run(df_save, y)
>>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>>>
>>>>> log_run(df_save, y)
>>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>>>
>>>>> log_run(df_save, y)
>>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>>>
>>>>> log_run(df_save, y)
>>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>>>
>>>>> log_run(df_save, y)
>>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>>>
>>>>> A little information on the problem DataFrame:
>>>>> ---
>>>>> len(df_save)
>>>>> Out[40]: 240
>>>>>
>>>>> len(df_save.columns)
>>>>> Out[41]: 18
>>>>>
>>>>>
>>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>>>
>>>>> df_save[‘abc'].head()
>>>>> Out[42]:
>>>>> 0 0.026316
>>>>> 1 0.333333
>>>>> 2 0.015152
>>>>> 3 0.010526
>>>>> 4 0.125000
>>>>> Name: abc, dtype: float64
>>>>>
>>>>>
>>>>> Does anyone have ideas on how I can figure this out? Is there some
>randomness/shuffling still going on I missed?
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Chris
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160817/96633586/attachment-0001.html>
More information about the scikit-learn
mailing list