Why does sci-kit learn's hashingvectorizer give negative values?
Hi, I'm trying to make the hashingvectorizer work for online learning. To do this, I need it to give actual token counts. The HashingVectorizer in Sci-Kit learn doesn't give token counts, but by default gives a normalized count either l1 or l2. I need the tokenized counts, so I set norm = None. However, after I do this, I'm no longer getting decimals, but I'm still getting negative numbers. It seems like the negatives can be removed by setting non_negative = True, which takes the absolute value of the values. However, I don't understand why the negatives are there in the first place, or what they mean. I'm not sure if the absolute values are corresponding to the token counts. Can someone please help explain what the HashingVectorizer is doing? How do I get the HashingVectorizer to return token counts? You can replicate my results with the following code - I'm using the 20newsgroups dataset which comes with sci-kit learn: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) from sklearn.feature_extraction.text import HashingVectorizer # produces normalized results with mean 0 and unit variance cv = HashingVectorizer(stop_words = 'english') X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces integer results both positive and negative cv = HashingVectorizer(stop_words = 'english', norm=None) X_train = cv.fit_transform(twenty_train.data) print(X_train) # produces only positive results but not sure if they correspond to counts cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True) X_train = cv.fit_transform(twenty_train.data) print(X_train)
On 01/10/16 15:34, Moyi Dang wrote:
However, I don't understand why the negatives are there in the first place, or what they mean. I'm not sure if the absolute values are corresponding to the token counts.
Can someone please help explain what the HashingVectorizer is doing? How do I get the HashingVectorizer to return token counts?
Hi Moyi, it's a mechanism to compensate for hash collisions, see https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute values are token counts for most practical applications (if you don't have too many collisions). There will be a PR shortly to make this more consistent.
Negative values are not really there to compensate for hash collisions. It's there because that makes the hashed vector space an approximation to the full vector space under inner product. On 2 October 2016 at 00:17, Roman Yurchak <rth.yurchak@gmail.com> wrote:
On 01/10/16 15:34, Moyi Dang wrote:
However, I don't understand why the negatives are there in the first place, or what they mean. I'm not sure if the absolute values are corresponding to the token counts.
Can someone please help explain what the HashingVectorizer is doing? How do I get the HashingVectorizer to return token counts?
Hi Moyi,
it's a mechanism to compensate for hash collisions, see https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute values are token counts for most practical applications (if you don't have too many collisions). There will be a PR shortly to make this more consistent.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Joel Nothman -
Moyi Dang -
Roman Yurchak