memory efficient feature extraction
Dear all, I was wondering if somebody could advise on the best way for generating/storing large sparse feature sets that do not fit in memory? In particular, I have the following workflow, Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR array on disk -> Training a classifier -> Predictions where the the generated feature set is too large to fit in RAM, however the classifier training can be done in one step (as it uses only certain rows of the CSR array) and the prediction can be split in several steps, all of which fit in memory. Since the training can be performed in one step, I'm not looking for incremental learning out-of-core approaches and saving features to disk for later processing is definitely useful. For instance, if it was possible to save the output of the HashingVectorizer to a single file on disk (using e.g. joblib.dump) then load this file as a memory map (using e.g. joblib.load(.., mmap_mode='r')) everything would work great. Due to memory constraints this cannot be done directly, and the best case scenario is applying HashingVectorizer on chunks of the dataset, which produces a series of sparse CSR arrays on disk. Then, - concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally). - I was not able to find an abstraction layer that would allow to represent these sparse arrays as a single array. For instance, dask could allow to do this for dense arrays ( http://dask.pydata.org/en/latest/array-stack.html ), however support for sparse arrays is only planned at this point ( https://github.com/dask/dask/issues/174 ). Finally, it is not possible to pre-allocate the full array on disk in advance (and access it as a memory map) because we don't know the number of non-zero elements in the sparse array before running the feature extraction. Of course, it is possible to overcome all these difficulties by using a machine with more memory, but my point is rather to have a memory efficient workflow. I would really appreciate any advice on this and would be happy to contribute to a project in the scikit-learn environment aiming to address similar issues, Thank you, Best, -- Roman
- concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally).
There is a fast path for stacking a series of CSR matrices. On 6 June 2016 at 22:19, Roman Yurchak <rth.yurchak@gmail.com> wrote:
Dear all,
I was wondering if somebody could advise on the best way for generating/storing large sparse feature sets that do not fit in memory? In particular, I have the following workflow,
Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR array on disk -> Training a classifier -> Predictions
where the the generated feature set is too large to fit in RAM, however the classifier training can be done in one step (as it uses only certain rows of the CSR array) and the prediction can be split in several steps, all of which fit in memory. Since the training can be performed in one step, I'm not looking for incremental learning out-of-core approaches and saving features to disk for later processing is definitely useful.
For instance, if it was possible to save the output of the HashingVectorizer to a single file on disk (using e.g. joblib.dump) then load this file as a memory map (using e.g. joblib.load(.., mmap_mode='r')) everything would work great. Due to memory constraints this cannot be done directly, and the best case scenario is applying HashingVectorizer on chunks of the dataset, which produces a series of sparse CSR arrays on disk. Then, - concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally). - I was not able to find an abstraction layer that would allow to represent these sparse arrays as a single array. For instance, dask could allow to do this for dense arrays ( http://dask.pydata.org/en/latest/array-stack.html ), however support for sparse arrays is only planned at this point ( https://github.com/dask/dask/issues/174 ). Finally, it is not possible to pre-allocate the full array on disk in advance (and access it as a memory map) because we don't know the number of non-zero elements in the sparse array before running the feature extraction.
Of course, it is possible to overcome all these difficulties by using a machine with more memory, but my point is rather to have a memory efficient workflow.
I would really appreciate any advice on this and would be happy to contribute to a project in the scikit-learn environment aiming to address similar issues,
Thank you, Best, -- Roman
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Joel, thanks for your response. On 06/06/16 14:29, Joel Nothman wrote:
- concatenation of theses arrays into a single CSR array appears to be non-tivial given the memory constraints (e.g. scipy.sparse.vstack transforms all arrays to COO sparse representation internally).
There is a fast path for stacking a series of CSR matrices.
Could you elaborate a bit more? When the final array is larger than the available memory? Do you mean something along the lines of, 1. Load all arrays of the series as memory maps, and calculate the expected final array shape 2. Allocate the `data`, `indices` and `indptr` arrays on disk using either numpy memory map or HDF5 3. Recalculate `indptr` for each array in the series and fill the 3 resulting arrays 4. Make sure that we can open these files as a scipy CSR array with the ability to load only a subset of rows to memory? I'm just wondering if there is a more standard storage solution in the scikit-learn environment that could be used efficiently with a stateless feature extractor (HashingVectorizer) , Cheers, -- Roman
participants (2)
-
Joel Nothman -
Roman Yurchak