Difference in normalization between Lasso and LogisticRegression + L1
Hi everyone, I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso... but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef. Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.) https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logis... Jesse
Hi Jesse, I think there was an effort to compare normalization methods on the data attachment term between Lasso and Ridge regression back in 2012/13, but this might have not been finished or extended to Logistic Regression. If it is not documented well, it could definitely benefit from a documentation update. As for changing it to a more consistent state, that would require adding a keyword argument pertaining to this functionality and, after discussion, possibly changing the default value after some deprecation cycles (though this seems like a dangerous one to change at all imho). Michael On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey@gmail.com> wrote:
Hi everyone,
I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso...
but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef.
Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.)
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logis...
Jesse _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
That is not very ideal indeed. I think we just went with what liblinear did, and when saga was introduced kept that behavior. It should probably be scaled as in Lasso, I would imagine? On 5/29/19 1:42 PM, Michael Eickenberg wrote:
Hi Jesse,
I think there was an effort to compare normalization methods on the data attachment term between Lasso and Ridge regression back in 2012/13, but this might have not been finished or extended to Logistic Regression.
If it is not documented well, it could definitely benefit from a documentation update.
As for changing it to a more consistent state, that would require adding a keyword argument pertaining to this functionality and, after discussion, possibly changing the default value after some deprecation cycles (though this seems like a dangerous one to change at all imho).
Michael
On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey@gmail.com <mailto:jesse.livezey@gmail.com>> wrote:
Hi everyone,
I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso...
but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef.
Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.) https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logis...
Jesse _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I looked into like a while ago. There were differences in which algorithms regularized the intercept, and which ones do not. (I believe liblinear does, lbgfs does not). All of the algorithms disagreed with logistic regression in scipy. - Stuart On Wed, May 29, 2019 at 10:50 AM Andreas Mueller <t3kcit@gmail.com> wrote:
That is not very ideal indeed. I think we just went with what liblinear did, and when saga was introduced kept that behavior. It should probably be scaled as in Lasso, I would imagine?
On 5/29/19 1:42 PM, Michael Eickenberg wrote:
Hi Jesse,
I think there was an effort to compare normalization methods on the data attachment term between Lasso and Ridge regression back in 2012/13, but this might have not been finished or extended to Logistic Regression.
If it is not documented well, it could definitely benefit from a documentation update.
As for changing it to a more consistent state, that would require adding a keyword argument pertaining to this functionality and, after discussion, possibly changing the default value after some deprecation cycles (though this seems like a dangerous one to change at all imho).
Michael
On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey@gmail.com> wrote:
Hi everyone,
I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso...
but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef.
Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.)
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logis...
Jesse _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
see https://github.com/scikit-learn/scikit-learn/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aclosed+scale_C+ for historical perspective on this issue. Alex On Wed, May 29, 2019 at 11:32 PM Stuart Reynolds <stuart@stuartreynolds.net> wrote:
I looked into like a while ago. There were differences in which algorithms regularized the intercept, and which ones do not. (I believe liblinear does, lbgfs does not). All of the algorithms disagreed with logistic regression in scipy.
- Stuart
On Wed, May 29, 2019 at 10:50 AM Andreas Mueller <t3kcit@gmail.com> wrote:
That is not very ideal indeed. I think we just went with what liblinear did, and when saga was introduced kept that behavior. It should probably be scaled as in Lasso, I would imagine?
On 5/29/19 1:42 PM, Michael Eickenberg wrote:
Hi Jesse,
I think there was an effort to compare normalization methods on the data attachment term between Lasso and Ridge regression back in 2012/13, but this might have not been finished or extended to Logistic Regression.
If it is not documented well, it could definitely benefit from a documentation update.
As for changing it to a more consistent state, that would require adding a keyword argument pertaining to this functionality and, after discussion, possibly changing the default value after some deprecation cycles (though this seems like a dangerous one to change at all imho).
Michael
On Wed, May 29, 2019 at 10:38 AM Jesse Livezey <jesse.livezey@gmail.com> wrote:
Hi everyone,
I noticed recently that in the Lasso implementation (and docs), the MSE term is normalized by the number of samples https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso...
but for LogisticRegression + L1, the logloss does not seem to be normalized by the number of samples. One consequence is that the strength of the regularization depends on the number of samples explicitly. For instance, in Lasso, if you tile a dataset N times, you will learn the same coef, but in LogisticRegression, you will learn a different coef.
Is this the intended behavior of LogisticRegression? I was surprised by this. Either way, it would be helpful to document this more clearly in the Logistic Regression docs (I can make a PR.) https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logis...
Jesse _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (5)
-
Alexandre Gramfort -
Andreas Mueller -
Jesse Livezey -
Michael Eickenberg -
Stuart Reynolds