[scikit-learn] HashingVectorizer slow in version 0.18

Gabriel Trautmann gabit7 at gmail.com
Tue Oct 11 07:29:20 EDT 2016


Hi,

After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times
slower.

Before:

scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  4.594092130661011  seconds, resulting shape
 (11314, 1048576)

After upgrade:

scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  43.587692737579346  seconds, resulting shape
 (11314, 1048576)


Code:

import time, sklearn, platform, numpy
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

print('scikit-learn {}. Numpy {}. Python {}
{}'.format(sklearn.__version__, numpy.version.full_version,
platform.python_version(), platform.machine()))

vectorizer = HashingVectorizer()
print("Vectorizing 20newsgroup {} documents".format(len(data_train.data)))
start = time.time()
data = vectorizer.fit_transform(data_train.data)
print("Vectorization completed in ", time.time() - start, ' seconds,
resulting shape ', data.shape)


Should I submit a bug report?

Thank you,

Gabriel Trautmann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/02ddfc1c/attachment.html>


More information about the scikit-learn mailing list