Question about Python's L2-Regularized Logistic Regression
Hi All, I am trying to understand Python’s code [function ‘_fit_liblinear' in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py] for fitting an L2-logistic regression for a ‘liblinear’ solver. More specifically, my [approximately balanced class] dataset is such that the # of predictors [p=2000] >> # of observations [n=100]. Therefore, I am currently confused that when I increase C [and thus decrease the regularization strength] in fitting the logistic regression model to my training data why I then still obtain high AUC results when the model is then applied to my testing data. Is python internally doing a feature selection when fitting this model for high C values? Or why is it that the almost unregularized model [high C values] versus regularized [cross-validated approach to selecting C] model both result in similar AUC and accuracy results when the model is applied to the testing data? Should I be coding my predictors as +1/-1? Any pointers/explanations would be much appreciated! Thanks, Kristen
Hi, Kristen, there shouldn’t be any internal feature selection going on behind the scenes. You may want to compare the weight coefficients of your regularized vs unregularized model, if they are exactly the same, then this would be an indicator that something funny is going on. Otherwise, it could be that both strongly- and non-regularized models are equally good or bad models on that dataset (btw. what value do you get for the ROC auc?). You can access the weight coefficients via the “coef_” attribute after fitting. I.e., lr = LogisticRegression(...) lr.fit(X_train, y_train) lr.coef_
Should I be coding my predictors as +1/-1?
0 and 1 should be just fine and is the expected default. Best, Sebastian
On Sep 29, 2016, at 6:09 PM, Kristen M. Altenburger <kaltenb@stanford.edu> wrote:
Hi All,
I am trying to understand Python’s code [function ‘_fit_liblinear' in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py] for fitting an L2-logistic regression for a ‘liblinear’ solver. More specifically, my [approximately balanced class] dataset is such that the # of predictors [p=2000] >> # of observations [n=100]. Therefore, I am currently confused that when I increase C [and thus decrease the regularization strength] in fitting the logistic regression model to my training data why I then still obtain high AUC results when the model is then applied to my testing data. Is python internally doing a feature selection when fitting this model for high C values? Or why is it that the almost unregularized model [high C values] versus regularized [cross-validated approach to selecting C] model both result in similar AUC and accuracy results when the model is applied to the testing data? Should I be coding my predictors as +1/-1?
Any pointers/explanations would be much appreciated!
Thanks, Kristen _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
That should totally depend on your dataset. Maybe it is an "easy" dataset and not much regularization is needed. Maybe use PCA(n_components=2) or an LDA transform to take a look at your data in 2D. Maybe they are easily linearly separable? Sklearn does not do any feature selection if you don't ask it to. What C-values are you using? Try an np.logspace but go much farther out both sides than you think reasonable. Then plot AUC as a function of that to get a global idea of what is going on. hth, Michael On Friday, September 30, 2016, Kristen M. Altenburger <kaltenb@stanford.edu> wrote:
Hi All,
I am trying to understand Python’s code [function ‘_fit_liblinear' in https://github.com/scikit-learn/scikit-learn/blob/ master/sklearn/svm/base.py] for fitting an L2-logistic regression for a ‘liblinear’ solver. More specifically, my [approximately balanced class] dataset is such that the # of predictors [p=2000] >> # of observations [n=100]. Therefore, I am currently confused that when I increase C [and thus decrease the regularization strength] in fitting the logistic regression model to my training data why I then still obtain high AUC results when the model is then applied to my testing data. Is python internally doing a feature selection when fitting this model for high C values? Or why is it that the almost unregularized model [high C values] versus regularized [cross-validated approach to selecting C] model both result in similar AUC and accuracy results when the model is applied to the testing data? Should I be coding my predictors as +1/-1?
Any pointers/explanations would be much appreciated!
Thanks, Kristen _______________________________________________ scikit-learn mailing list scikit-learn@python.org <javascript:;> https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Kristen M. Altenburger -
Michael Eickenberg -
Sebastian Raschka