[scikit-learn] Inconsistent Logistic Regression fit results

Andreas Mueller t3kcit at gmail.com
Mon Aug 15 18:17:25 EDT 2016


Hm that looks kinda convoluted.
Why don't you just do

     df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0)


?
What version of scikit-learn are you using?

Also, you are modifying the inputs. Can you try to do the same but
pass a copy of the input dataframe to the method each time?


On 08/15/2016 06:00 PM, Chris Cameron wrote:
> Sebastian,
>
> That doesn’t do it. With the function:
>
> def log_run(logreg_x, logreg_y):
>      logreg_x['pass_fail'] = logreg_y
>      df_train, df_test = train_test_split(logreg_x, random_state=0)
>      y_train = df_train.pass_fail.as_matrix()
>      y_test = df_test.pass_fail.as_matrix()
>      del(df_train['pass_fail'])
>      del(df_test['pass_fail'])
>      log_reg_fit = LogisticRegression(class_weight='balanced',
>                                       tol=0.000000001,
>                                       random_state=0).fit(df_train, y_train)
>      predicted = log_reg_fit.predict(df_test)
>      accuracy = accuracy_score(y_test, predicted)
>      kappa = cohen_kappa_score(y_test, predicted)
>      
>      return [kappa, accuracy]
>
> I’m still seeing:
> log_run(df_save, y)
> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>
> log_run(df_save, y)
> Out[8]: [0.042553191489361743, 0.55000000000000004]
>
> log_run(df_save, y)
> Out[9]: [0.042553191489361743, 0.55000000000000004]
>
> log_run(df_save, y)
> Out[10]: [0.027777777777777728, 0.53333333333333333]
>
>
> Chris
>
>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>
>> Hi, Chris,
>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
>>
>> Best,
>> Sebastian
>>
>>
>>
>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>>>
>>> Hi all,
>>>
>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>>>
>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
>>>
>>> The code I’m using:
>>>
>>> def log_run(logreg_x, logreg_y):
>>>    logreg_x['pass_fail'] = logreg_y
>>>    df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>    y_train = df_train.pass_fail.as_matrix()
>>>    y_test = df_test.pass_fail.as_matrix()
>>>    del(df_train['pass_fail'])
>>>    del(df_test['pass_fail'])
>>>    log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>>    predicted = log_reg_fit.predict(df_test)
>>>    accuracy = accuracy_score(y_test, predicted)
>>>    kappa = cohen_kappa_score(y_test, predicted)
>>>
>>>    return [kappa, accuracy]
>>>
>>>
>>> I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
>>>
>>> Example output:
>>> ---
>>> log_run(df_save, y)
>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>
>>> log_run(df_save, y)
>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>
>>> log_run(df_save, y)
>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>
>>> log_run(df_save, y)
>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> log_run(df_save, y)
>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>
>>> log_run(df_save, y)
>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> A little information on the problem DataFrame:
>>> ---
>>> len(df_save)
>>> Out[40]: 240
>>>
>>> len(df_save.columns)
>>> Out[41]: 18
>>>
>>>
>>> If I omit this particular column the Kappa no longer fluctuates:
>>>
>>> df_save[‘abc'].head()
>>> Out[42]:
>>> 0    0.026316
>>> 1    0.333333
>>> 2    0.015152
>>> 3    0.010526
>>> 4    0.125000
>>> Name: abc, dtype: float64
>>>
>>>
>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>>>
>>>
>>> Thanks!
>>> Chris
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list