[scikit-learn] memory efficient feature extraction

Mon Jun 6 18:27:57 EDT 2016

Hi Joel,

thanks for your response.

On 06/06/16 14:29, Joel Nothman wrote:
>      - concatenation of theses arrays into a single CSR array appears to be
>     non-tivial given the memory constraints (e.g. scipy.sparse.vstack
>     transforms all arrays to COO sparse representation internally).
> 
> There is a fast path for stacking a series of CSR matrices. 
Could you elaborate a bit more? When the final array is larger than the
available memory?

Do you mean something along the lines of,

  1. Load all arrays of the series as memory maps, and calculate the
expected final array shape
  2. Allocate the `data`, `indices` and `indptr` arrays on disk using
either numpy memory map or HDF5
  3. Recalculate `indptr` for each array in the series and fill the 3
resulting arrays
  4. Make sure that we can open these files as a scipy CSR array with
the ability to load only a subset of rows to memory?

I'm just wondering if there is a more standard storage solution in the
scikit-learn environment that could be used efficiently with a
stateless feature extractor (HashingVectorizer) ,

Cheers,
-- 
Roman