[scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update)
Basil Beirouti
basilbeirouti at gmail.com
Thu Jun 30 18:23:18 EDT 2016
Hello everyone,
I have successfully created a few versions of the BM25Transformer. I looked
at TFIDFTransformer for guidance and I noticed that it outputs a sparse
matrix when given a sparse termcount matrix as an input.
Unfortunately, the fastest implementation of BM25Transformer that I have
been able to come up with does NOT output a sparse matrix, it will return a
regular numpy matrix.
Benchmarked against the entire 20newsgroups corpus, here is how they
perform (assuming input is csr_matrix for all):
1.) finishes in 4 seconds, outputs a regular numpy matrix
2.) finishes in 30 seconds, outputs a dok_matrix
3.) finishes in 130 seconds, outputs a regular numpy matrix
It's worth noting that using algorithm 1 and converting the output to a
sparse matrix still takes less time than 3, and takes about as long as 2.
So my question is, how important is it that my BM25Transformer outputs a
sparse matrix?
I'm going to try another implementation which looks directly at the data,
indices, and indptr attributes of the inputted csr_matrix. I just wanted to
check in and see what people thought.
Sincerely,
Basil Beirouti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160630/80852326/attachment.html>
More information about the scikit-learn
mailing list