Explaining pull request #12197 - Fast PolynomialFeatures on CSR matrices
Hi scikit-learn community. I've making a pull request <https://github.com/scikit-learn/scikit-learn/pull/12197> to add a method that allows for polynomial features to be computed on compressed sparse row (CSR) matrices directly. It takes advantage of data sparsity, only taking products of nonzero features, resulting in a speedup proportional to the density to the power of the degree of the expansion. The method is laid out in this work <https://arxiv.org/pdf/1803.06418.pdf>. This yields an impressive speedups, as seen in this plot: [image: blippity.png] This is for degree = 2, and the effects become even more pronounced for degree = 3. All the lines for the dense method are overlapping each other. This is because the dense method doesn't leverage sparsity. Also, notice that the CSR algorithm approaches the time of the dense algorithm as the density approaches 1.0. There's even a slight speedup for the fully dense case, which I attribute to the fact that the code's in Cython. One might wonder why I didn't include the times for compressed sparse column (CSC) matrices, which PolynomialFeatures does support. The reason is that this is very, very slow. In fact, it's slower than just passing a dense matrix. I ran a single trial with a 100 x 500 matrix with a density of 0.5. For degree=2, the CSC algorithm took 56.88 seconds while the CSR algorithm took 0.1363. I tried checking for degree=3, but gave up after waiting for what seemed like 20 minutes for the CSC method, but the CSR method finished in 35.38 seconds. The only caveat is that the algorithm requires a function to be derived for each degree. I've done this, as laid out in the paper, for degrees 2 and 3. I don't think people will typically even want higher degrees due to the explosion of the size of the feature space, but I did lay out the pattern for deriving the functions, just in case. Test results: [image: Screen Shot 2018-09-28 at 10.34.19 PM.png]
participants (1)
-
Andrew Nystrom