RFE with logistic regression
Dear scikit-learn users, I am using the recursive feature elimination (RFE) tool from sklearn to rank my features: from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking) 1. The first problem I have is when I execute the above code multiple times, I don't get the same results. 2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers. 3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers. Thanks for your help, Best regards, Ben
liblinear regularizes the intercept (which is a questionable thing to do and a poor choice of default in sklearn). The other solvers do not. On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles <benoit.presles@u-bourgogne.fr> wrote:
Dear scikit-learn users,
I am using the recursive feature elimination (RFE) tool from sklearn to rank my features:
from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking)
1. The first problem I have is when I execute the above code multiple times, I don't get the same results.
2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers.
3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers.
Thanks for your help, Best regards, Ben
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Agreed. But then the setting is c=1e9 in this context (where C is the inverse regularization strength), so the regularization effect should be very small. Probably shouldn't matter much for convex optimization, but I would still try to a) set the random_state to some fixed value b) make sure that .n_iter_ < .max_iter to see if that results in more consistency. Best, Sebastian
On Jul 24, 2018, at 11:16 AM, Stuart Reynolds <stuart@stuartreynolds.net> wrote:
liblinear regularizes the intercept (which is a questionable thing to do and a poor choice of default in sklearn). The other solvers do not.
On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles <benoit.presles@u-bourgogne.fr> wrote:
Dear scikit-learn users,
I am using the recursive feature elimination (RFE) tool from sklearn to rank my features:
from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking)
1. The first problem I have is when I execute the above code multiple times, I don't get the same results.
2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers.
3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers.
Thanks for your help, Best regards, Ben
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I did the same tests as before adding fit_intercept=False and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. How can I get reproducible and consistent results? Thanks for your help, Best regards, Ben Le 24/07/2018 à 18:16, Stuart Reynolds a écrit :
liblinear regularizes the intercept (which is a questionable thing to do and a poor choice of default in sklearn). The other solvers do not.
On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles <benoit.presles@u-bourgogne.fr> wrote:
Dear scikit-learn users,
I am using the recursive feature elimination (RFE) tool from sklearn to rank my features:
from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking)
1. The first problem I have is when I execute the above code multiple times, I don't get the same results.
2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers.
3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers.
Thanks for your help, Best regards, Ben
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I addition to checking _n_iter and fixing the random seed as I suggested maybe also try normalizing the features (eg z scores via the standard scale we) to see if that stabilizes the training Sent from my iPhone
On Jul 24, 2018, at 1:07 PM, Benoît Presles <benoit.presles@u-bourgogne.fr> wrote:
I did the same tests as before adding fit_intercept=False and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results?
Thanks for your help, Best regards, Ben
Le 24/07/2018 à 18:16, Stuart Reynolds a écrit : liblinear regularizes the intercept (which is a questionable thing to do and a poor choice of default in sklearn). The other solvers do not.
On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles <benoit.presles@u-bourgogne.fr> wrote:
Dear scikit-learn users,
I am using the recursive feature elimination (RFE) tool from sklearn to rank my features:
from sklearn.linear_model import LogisticRegression classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000) from sklearn.feature_selection import RFE rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1) rfe.fit(X, y) ranking = rfe.ranking_ print(ranking)
1. The first problem I have is when I execute the above code multiple times, I don't get the same results.
2. When I change the solver to 'sag' or 'saga' (classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=10000), solver='sag'), it seems that I get the same results at each run but the ranking is not the same between these two solvers.
3. With C=1, it seems I have the same results at each run for the solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't get the same results between the different solvers.
Thanks for your help, Best regards, Ben
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results?
Did you scale your data? If not, saga and sag will basically fail.
I did the same tests as before adding random_state=0 and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' or 'saga' (LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, random_state=0, solver='sag')), it seems that I get the same results at each run but the ranking is not the same between these two solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Thanks for your help, Ben PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler". Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results?
Did you scale your data? If not, saga and sag will basically fail. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Can you share your data or reproduce with synthetic data? On 07/24/2018 02:43 PM, Benoît Presles wrote:
I did the same tests as before adding random_state=0 and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' or 'saga' (LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, random_state=0, solver='sag')), it seems that I get the same results at each run but the ranking is not the same between these two solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results?
Did you scale your data? If not, saga and sag will basically fail. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote:
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
Your problem is probably ill-conditioned, hence the specific weights on the features are not stable. There isn't a good answer to ordering features, they are degenerate. In general, I would avoid RFE, it is a hack, and can easily lead to these problems. Gaël
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and:
1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time.
2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers.
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results?
Did you scale your data? If not, saga and sag will basically fail. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
So you think that I cannot get reproducible and consistent results with this method ? If you would avoid RFE, which method do you suggest to find the best features ? Ben Le 24/07/2018 à 21:34, Gael Varoquaux a écrit :
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Your problem is probably ill-conditioned, hence the specific weights on
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote: the features are not stable. There isn't a good answer to ordering features, they are degenerate.
In general, I would avoid RFE, it is a hack, and can easily lead to these problems.
Gaël
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results? Did you scale your data? If not, saga and sag will basically fail.
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Univariate screening is somewhat hackish too, but much more stable -- and cheap. Best, Bertrand On 24/07/2018 23:33, Benoît Presles wrote:
So you think that I cannot get reproducible and consistent results with this method ? If you would avoid RFE, which method do you suggest to find the best features ?
Ben
Le 24/07/2018 à 21:34, Gael Varoquaux a écrit :
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Your problem is probably ill-conditioned, hence the specific weights on
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote: the features are not stable. There isn't a good answer to ordering features, they are degenerate.
In general, I would avoid RFE, it is a hack, and can easily lead to these problems.
Gaël
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results? Did you scale your data? If not, saga and sag will basically fail.
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Do you think the problems I have can come from correlated features? Indeed, in my dataset I have some highly correlated features. Do you think this could explain why I don't get reproducible and consistent results? Thanks for your help, Ben Le 24/07/2018 à 23:44, bthirion a écrit :
Univariate screening is somewhat hackish too, but much more stable -- and cheap. Best,
Bertrand
On 24/07/2018 23:33, Benoît Presles wrote:
So you think that I cannot get reproducible and consistent results with this method ? If you would avoid RFE, which method do you suggest to find the best features ?
Ben
Le 24/07/2018 à 21:34, Gael Varoquaux a écrit :
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Your problem is probably ill-conditioned, hence the specific weights on
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote: the features are not stable. There isn't a good answer to ordering features, they are degenerate.
In general, I would avoid RFE, it is a hack, and can easily lead to these problems.
Gaël
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote:
I did the same tests as before adding fit_intercept=False and: 1. I have got the same problem as before, i.e. when I execute the RFE multiple times I don't get the same ranking each time. 2. When I change the solver to 'sag' (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, fit_intercept=False, solver='sag')), it seems that I get the same ranking at each run. This is not the case with the 'saga' solver. The ranking is not the same between the solvers. 3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers.
How can I get reproducible and consistent results? Did you scale your data? If not, saga and sag will basically fail.
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Wed, Jul 25, 2018 at 12:36:55PM +0200, Benoît Presles wrote:
Do you think the problems I have can come from correlated features? Indeed, in my dataset I have some highly correlated features.
Yes, in general selecting features conditionally on others is very hard when features are highly correlated.
Do you think this could explain why I don't get reproducible and consistent results?
Yes.
Thanks for your help, Ben
Le 24/07/2018 à 23:44, bthirion a écrit :
Univariate screening is somewhat hackish too, but much more stable -- and cheap. Best,
Bertrand
On 24/07/2018 23:33, Benoît Presles wrote:
So you think that I cannot get reproducible and consistent results with this method ? If you would avoid RFE, which method do you suggest to find the best features ?
Ben
Le 24/07/2018 à 21:34, Gael Varoquaux a écrit :
3. With C=1, it seems that I have the same results at each run for all solvers (liblinear, sag and saga), however the ranking is not the same between the solvers. Your problem is probably ill-conditioned, hence the specific weights on
On Tue, Jul 24, 2018 at 08:43:27PM +0200, Benoît Presles wrote: the features are not stable. There isn't a good answer to ordering features, they are degenerate.
In general, I would avoid RFE, it is a hack, and can easily lead to these problems.
Gaël
Thanks for your help, Ben
PS1: I checked and n_iter_ seems to be always lower than max_iter. PS2: my data is scaled, I am using "StandardScaler".
Le 24/07/2018 à 20:33, Andreas Mueller a écrit :
On 07/24/2018 02:07 PM, Benoît Presles wrote: > I did the same tests as before adding fit_intercept=False and: > 1. I have got the same problem as before, i.e. when I execute the > RFE multiple times I don't get the same ranking each time. > 2. When I change the solver to 'sag' > (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=10000, > fit_intercept=False, solver='sag')), it seems that I get the same > ranking at each run. This is not the case with the 'saga' solver. > The ranking is not the same between the solvers. > 3. With C=1, it seems that I have the same results at each run for > all solvers (liblinear, sag and saga), however the ranking is not > the same between the solvers.
> How can I get reproducible and consistent results? Did you scale your data? If not, saga and sag will basically fail. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
participants (6)
-
Andreas Mueller -
Benoît Presles -
bthirion -
Gael Varoquaux -
Sebastian Raschka -
Stuart Reynolds