[scikit-learn] Question about dummy coding using DictVectorizer or FeatureHasher: generating correlated dimensions

Mon Nov 6 21:42:45 EST 2017

Hello,

I have a question about dummy coding using DictVectorizer or FeatureHasher.

```
>>> from sklearn.feature_extraction import DictVectorizer, FeatureHasher
>>> D = [{'age': 23, 'gender': 'm'},{'age': 34, 'gender': 'f'},{'age': 18,
'gender': 'f'},{'age': 50, 'gender': 'm'}]
>>> m1 = FeatureHasher(n_features=10)
>>> m1.fit_transform(D).toarray()
array([[  0.,   0.,  -1.,   0.,   0.,   0.,   0.,   0.,   0.,  23.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,  34.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,  18.],
       [  0.,   0.,  -1.,   0.,   0.,   0.,   0.,   0.,   0.,  50.]])
>>> m2 = DictVectorizer(sparse=False)
>>> m2.fit_transform(D)
array([[ 23.,   0.,   1.],
       [ 34.,   1.,   0.],
       [ 18.,   1.,   0.],
       [ 50.,   0.,   1.]])
>>> m2.feature_names_
['age', 'gender=f', 'gender=m']
```

Since both DictVectorizer and FeatureHasher generate dimensions for
'gender=m' and 'gender=f',
these dimensions are perfectly correlated.
This is because DictVectorizer and FeatureHasher by default generate n
dimensions for n categorical values of 1 feature.

My questions are as follows:

1. My expectation is for them to generate n-1 dimensions for n categorical
values,
   and is there any way to do this using DictVectorizer and FeatureHasher?
2. How should I handle these correlated dimensions?
   In my understanding, the training on data which has colinearity will
make prediction unstable.
   Will L1 or L2 regularization work for this problem?

If there is any issue or article related to these questions,
would you please tell me the URL? Thank you.

Regards,
Yusuke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171107/c67ed69a/attachment.html>