[scikit-learn] Explaining pull request #12197 - Fast PolynomialFeatures on CSR matrices

Andrew Nystrom awnystrom at gmail.com
Sat Sep 29 02:34:38 EDT 2018

Hi scikit-learn community.

I've making a pull request
<https://github.com/scikit-learn/scikit-learn/pull/12197> to add a method
that allows for polynomial features to be computed on compressed sparse row
(CSR) matrices directly. It takes advantage of data sparsity, only taking
products of nonzero features, resulting in a speedup proportional to the
density to the power of the degree of the expansion. The method is laid out
in this work <https://arxiv.org/pdf/1803.06418.pdf>. This yields an
impressive speedups, as seen in this plot:
[image: blippity.png]
This is for degree = 2, and the effects become even more pronounced for
degree = 3.

All the lines for the dense method are overlapping each other. This is
because the dense method doesn't leverage sparsity. Also, notice that the
CSR algorithm approaches the time of the dense algorithm as the density
approaches 1.0. There's even a slight speedup for the fully dense case,
which I attribute to the fact that the code's in Cython.

One might wonder why I didn't include the times for compressed sparse
column (CSC) matrices, which PolynomialFeatures does support. The reason is
that this is very, very slow. In fact, it's slower than just passing a
dense matrix. I ran a single trial with a 100 x 500 matrix with a density
of 0.5. For degree=2, the CSC algorithm took 56.88 seconds while the CSR
algorithm took 0.1363. I tried checking for degree=3, but gave up after
waiting for what seemed like 20 minutes for the CSC method, but the CSR
method finished in 35.38 seconds.

The only caveat is that the algorithm requires a function to be derived for
each degree. I've done this, as laid out in the paper, for degrees 2 and 3.
I don't think people will typically even want higher degrees due to the
explosion of the size of the feature space, but I did lay out the pattern
for deriving the functions, just in case.

Test results:
[image: Screen Shot 2018-09-28 at 10.34.19 PM.png]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180928/2cc5148f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blippity.png
Type: image/png
Size: 65149 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180928/2cc5148f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2018-09-28 at 10.34.19 PM.png
Type: image/png
Size: 503007 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180928/2cc5148f/attachment-0003.png>

More information about the scikit-learn mailing list