Inconsistent Logistic Regression fit results
Hi all, Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results. The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit. The code I’m using: def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on. Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333] log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333] log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337] log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672] log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004] A little information on the problem DataFrame: --- len(df_save) Out[40]: 240 len(df_save.columns) Out[41]: 18 If I omit this particular column the Kappa no longer fluctuates: df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64 Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed? Thanks! Chris
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet. Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Sebastian, That doesn’t do it. With the function: def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] I’m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334] log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004] log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333] Chris
On Aug 15, 2016, at 3:42 PM, mail@sebastianraschka.com wrote:
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hm that looks kinda convoluted. Why don't you just do df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0) ? What version of scikit-learn are you using? Also, you are modifying the inputs. Can you try to do the same but pass a copy of the input dataframe to the method each time? On 08/15/2016 06:00 PM, Chris Cameron wrote:
Sebastian,
That doesn’t do it. With the function:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334]
log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333]
Chris
On Aug 15, 2016, at 3:42 PM, mail@sebastianraschka.com wrote:
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist? Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I’d recommend setting
y_train = df_train.pass_fail.values y_test = df_test.pass_fail.values
instead of
y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix()
Also, try passing NumPy arrays to the fit method:
log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
and
predicted = log_reg_fit.predict(df_test.values)
and so forth.
On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris@upnix.com> wrote:
Sebastian,
That doesn’t do it. With the function:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334]
log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333]
Chris
On Aug 15, 2016, at 3:42 PM, mail@sebastianraschka.com wrote:
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thank you everyone for your help. The short version of this email is that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem - but only if I upped “max_iter” to 1000. Longer version - Without max_iter=1000, I would get the warning: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge I have some columns in my data that have a huge range of values. Using “liblinear”, if I transformed those columns, causing the range to be smaller, the results would be consistent every time. This is the function I ended up using - def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0) del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.00000001, random_state=8, solver='sag', max_iter=1000).fit(df_train.values, y_train) predicted = log_reg_fit.predict(df_test.values) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted) return [kappa, accuracy] Thank you again for the help, Chris
On Aug 15, 2016, at 4:26 PM, mail@sebastianraschka.com wrote:
hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?
Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I’d recommend setting
y_train = df_train.pass_fail.values y_test = df_test.pass_fail.values
instead of
y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix()
Also, try passing NumPy arrays to the fit method:
log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
and
predicted = log_reg_fit.predict(df_test.values)
and so forth.
On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris@upnix.com> wrote:
Sebastian,
That doesn’t do it. With the function:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334]
log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333]
Chris
On Aug 15, 2016, at 3:42 PM, mail@sebastianraschka.com wrote:
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning. Not a bug. An expected behavior. Sent from my phone. Please forgive brevity and mis spelling On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <chris@upnix.com> wrote:
Thank you everyone for your help. The short version of this email is that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem - but only if I upped “max_iter” to 1000.
Longer version - Without max_iter=1000, I would get the warning: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
I have some columns in my data that have a huge range of values. Using “liblinear”, if I transformed those columns, causing the range to be smaller, the results would be consistent every time.
This is the function I ended up using - def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0) del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.00000001, random_state=8, solver='sag', max_iter=1000).fit(df_train.values, y_train) predicted = log_reg_fit.predict(df_test.values) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
Thank you again for the help,
Chris
On Aug 15, 2016, at 4:26 PM, mail@sebastianraschka.com wrote:
hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?
Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I’d recommend setting
y_train = df_train.pass_fail.values y_test = df_test.pass_fail.values
instead of
y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix()
Also, try passing NumPy arrays to the fit method:
log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
and
predicted = log_reg_fit.predict(df_test.values)
and so forth.
On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris@upnix.com> wrote:
Sebastian,
That doesn’t do it. With the function:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced', tol=0.000000001, random_state=0).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’m still seeing: log_run(df_save, y) Out[7]: [-0.054421768707483005, 0.48333333333333334]
log_run(df_save, y) Out[8]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[9]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[10]: [0.027777777777777728, 0.53333333333333333]
Chris
On Aug 15, 2016, at 3:42 PM, mail@sebastianraschka.com wrote:
Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian
On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris@upnix.com> wrote:
Hi all,
Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.” I am experiencing this, however the fix of using a smaller “tol” parameter isn’t providing me with consistent fit.
The code I’m using:
def log_run(logreg_x, logreg_y): logreg_x['pass_fail'] = logreg_y df_train, df_test = train_test_split(logreg_x, random_state=0) y_train = df_train.pass_fail.as_matrix() y_test = df_test.pass_fail.as_matrix() del(df_train['pass_fail']) del(df_test['pass_fail']) log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train) predicted = log_reg_fit.predict(df_test) accuracy = accuracy_score(y_test, predicted) kappa = cohen_kappa_score(y_test, predicted)
return [kappa, accuracy]
I’ve gone out of my way to be sure the test and train data is the same for each run, so I don’t think there should be random shuffling going on.
Example output: --- log_run(df_save, y) Out[32]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[33]: [0.027777777777777728, 0.53333333333333333]
log_run(df_save, y) Out[34]: [0.11347517730496456, 0.58333333333333337]
log_run(df_save, y) Out[35]: [0.042553191489361743, 0.55000000000000004]
log_run(df_save, y) Out[36]: [-0.07407407407407407, 0.51666666666666672]
log_run(df_save, y) Out[37]: [0.042553191489361743, 0.55000000000000004]
A little information on the problem DataFrame: --- len(df_save) Out[40]: 240
len(df_save.columns) Out[41]: 18
If I omit this particular column the Kappa no longer fluctuates:
df_save[‘abc'].head() Out[42]: 0 0.026316 1 0.333333 2 0.015152 3 0.010526 4 0.125000 Name: abc, dtype: float64
Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
Thanks! Chris _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Andreas Mueller -
Chris Cameron -
Gael Varoquaux -
mail@sebastianraschka.com