[scikit-learn] LogisticRegression coef_ greater than n_features?

Tue Jan 8 20:07:03 EST 2019

It seems like it's determined by the order in which they occur in the training set. E.g.,

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x = np.array([['b'],
              ['a'], 
              ['b']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[0., 1.],
        [1., 0.],
        [0., 1.]])

and

x = np.array([['a'],
              ['b'], 
              ['a']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0.],
        [0., 1.],
        [1., 0.]])

Not sure how you used the OHE, but you also want to make sure that you only use it on those columns that are indeed categorical, e.g., note the following behavior: 

x = np.array([['a', 1.1],
              ['b', 1.2], 
              ['a', 1.3]])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0., 1., 0., 0.],
        [0., 1., 0., 1., 0.],
        [1., 0., 0., 0., 1.]])

Best,
Sebastian

> On Jan 8, 2019, at 9:33 AM, pisymbol <pisymbol at gmail.com> wrote:
> 
> Also Sebastian, I have binary classes but they are strings:
> 
> clf.classes_:
> array(['American', 'Southwest'], dtype=object)
> 
> 
> 
> On Tue, Jan 8, 2019 at 9:51 AM pisymbol <pisymbol at gmail.com> wrote:
> If that is the case, what order are the coefficients in then?
> 
> -aps
> 
> On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka <mail at sebastianraschka.com> wrote:
> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features.
> 
> Best,
> Sebastian
> 
> > On Jan 7, 2019, at 11:02 PM, pisymbol <pisymbol at gmail.com> wrote:
> > 
> > 
> > 
> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:
> > According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22.
> > 
> > What am I missing? How could coef_ > n_features?
> > 
> > 
> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features).
> > 
> > Could my pipeline actually add two more features during fitting?
> > 
> > -aps
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn