[scikit-learn] Inconsistent Logistic Regression fit results

Gael Varoquaux gael.varoquaux at normalesup.org
Wed Aug 17 03:23:12 EDT 2016


In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning. 

Not a bug. An expected behavior. 

Sent from my phone. Please forgive brevity and mis spelling



On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <chris at upnix.com> wrote:
>Thank you everyone for your help. The short version of this email is
>that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem -
>but only if I upped “max_iter” to 1000.
>
>
>Longer version - 
>Without max_iter=1000, I would get the warning:
>ConvergenceWarning: The max_iter was reached which means the coef_ did
>not converge
>
>I have some columns in my data that have a huge range of values. Using
>“liblinear”, if I transformed those columns, causing the range to be
>smaller, the results would be consistent every time.
>
>This is the function I ended up using -
>def log_run(logreg_x, logreg_y):
>    logreg_x['pass_fail'] = logreg_y
>df_train, df_test, y_train, y_test = train_test_split(logreg_x,
>logreg_y, random_state=0)
>    del(df_train['pass_fail'])
>    del(df_test['pass_fail'])
>    log_reg_fit = LogisticRegression(class_weight='balanced',
>                                     tol=0.00000001,
>                                     random_state=8,
>                                     solver='sag',
>                           max_iter=1000).fit(df_train.values, y_train)
>    predicted = log_reg_fit.predict(df_test.values)
>    accuracy = accuracy_score(y_test, predicted)
>    kappa = cohen_kappa_score(y_test, predicted)
>        
>    return [kappa, accuracy]
>
>
>Thank you again for the help,
>
>Chris
>
>> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
>> 
>> hm, was worth a try. What happens if you change the solver to
>something else than liblinear, does this issue still persist?
>> 
>> 
>> Btw. scikit-learn works with NumPy arrays, not NumPy matrices.
>Probably unrelated to your issue, I’d recommend setting
>> 
>>>   y_train = df_train.pass_fail.values
>>>   y_test = df_test.pass_fail.values
>> 
>> instead of
>> 
>>>   y_train = df_train.pass_fail.as_matrix()
>>>   y_test = df_test.pass_fail.as_matrix()
>> 
>> 
>> Also, try passing NumPy arrays to the fit method:
>> 
>>>   log_reg_fit = LogisticRegression(...).fit(df_train.values,
>y_train)
>> 
>> and
>> 
>>> predicted = log_reg_fit.predict(df_test.values)
>> 
>> and so forth.
>> 
>> 
>> 
>> 
>> 
>>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>>> 
>>> Sebastian,
>>> 
>>> That doesn’t do it. With the function:
>>> 
>>> def log_run(logreg_x, logreg_y):
>>>   logreg_x['pass_fail'] = logreg_y
>>>   df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>   y_train = df_train.pass_fail.as_matrix()
>>>   y_test = df_test.pass_fail.as_matrix()
>>>   del(df_train['pass_fail'])
>>>   del(df_test['pass_fail'])
>>>   log_reg_fit = LogisticRegression(class_weight='balanced',
>>>                                    tol=0.000000001,
>>>                                    random_state=0).fit(df_train,
>y_train)
>>>   predicted = log_reg_fit.predict(df_test)
>>>   accuracy = accuracy_score(y_test, predicted)
>>>   kappa = cohen_kappa_score(y_test, predicted)
>>> 
>>>   return [kappa, accuracy]
>>> 
>>> I’m still seeing:
>>> log_run(df_save, y)
>>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>>> 
>>> log_run(df_save, y)
>>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> log_run(df_save, y)
>>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> log_run(df_save, y)
>>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>>> 
>>> 
>>> Chris
>>> 
>>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>>> 
>>>> Hi, Chris,
>>>> have you set the random seed to a specific, contant integer value?
>Note that the default in LogisticRegression is random_state=None.
>Setting it to some arbitrary number like 123 may help if you haven’t
>done so, yet.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> 
>>>> 
>>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com>
>wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Using the same X and y values
>sklearn.linear_model.LogisticRegression.fit() is providing me with
>inconsistent results.
>>>>> 
>>>>> The documentation for sklearn.linear_model.LogisticRegression
>states that "It is thus not uncommon, to have slightly different
>results for the same input data.” I am experiencing this, however the
>fix of using a smaller “tol” parameter isn’t providing me with
>consistent fit.
>>>>> 
>>>>> The code I’m using:
>>>>> 
>>>>> def log_run(logreg_x, logreg_y):
>>>>> logreg_x['pass_fail'] = logreg_y
>>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>>> y_train = df_train.pass_fail.as_matrix()
>>>>> y_test = df_test.pass_fail.as_matrix()
>>>>> del(df_train['pass_fail'])
>>>>> del(df_test['pass_fail'])
>>>>> log_reg_fit =
>LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train,
>y_train)
>>>>> predicted = log_reg_fit.predict(df_test)
>>>>> accuracy = accuracy_score(y_test, predicted)
>>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>>> 
>>>>> return [kappa, accuracy]
>>>>> 
>>>>> 
>>>>> I’ve gone out of my way to be sure the test and train data is the
>same for each run, so I don’t think there should be random shuffling
>going on.
>>>>> 
>>>>> Example output:
>>>>> ---
>>>>> log_run(df_save, y)
>>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>>> 
>>>>> A little information on the problem DataFrame:
>>>>> ---
>>>>> len(df_save)
>>>>> Out[40]: 240
>>>>> 
>>>>> len(df_save.columns)
>>>>> Out[41]: 18
>>>>> 
>>>>> 
>>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>>> 
>>>>> df_save[‘abc'].head()
>>>>> Out[42]: 
>>>>> 0    0.026316
>>>>> 1    0.333333
>>>>> 2    0.015152
>>>>> 3    0.010526
>>>>> 4    0.125000
>>>>> Name: abc, dtype: float64
>>>>> 
>>>>> 
>>>>> Does anyone have ideas on how I can figure this out? Is there some
>randomness/shuffling still going on I missed?
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> Chris
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> 
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160817/96633586/attachment-0001.html>


More information about the scikit-learn mailing list