[scikit-learn] HashingVectorizer slow in version 0.18
Gabriel Trautmann
gabit7 at gmail.com
Tue Oct 11 07:29:20 EDT 2016
Hi,
After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times
slower.
Before:
scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in 4.594092130661011 seconds, resulting shape
(11314, 1048576)
After upgrade:
scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in 43.587692737579346 seconds, resulting shape
(11314, 1048576)
Code:
import time, sklearn, platform, numpy
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
print('scikit-learn {}. Numpy {}. Python {}
{}'.format(sklearn.__version__, numpy.version.full_version,
platform.python_version(), platform.machine()))
vectorizer = HashingVectorizer()
print("Vectorizing 20newsgroup {} documents".format(len(data_train.data)))
start = time.time()
data = vectorizer.fit_transform(data_train.data)
print("Vectorization completed in ", time.time() - start, ' seconds,
resulting shape ', data.shape)
Should I submit a bug report?
Thank you,
Gabriel Trautmann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/02ddfc1c/attachment.html>
More information about the scikit-learn
mailing list