Hi, I was wondering what the most (memory) efficient way of combining two sparse matrices would be. I am constructing a very large sparse matrix, but due to the temporary memory required to calculate the entries I am doing it in blocks, with the computation of each block done in a forked child process. This returns a sparse matrix of the same dimensions as the full one, but with a smaller number of entries. I would like to add the entries from the block result to the 'master' copy. I can be sure that there will be no overlap in the position of entries (ie no matrix position will be in both sides). What is the most memory efficient way of combining these? I noticed += isn't implemented, but it's not clear how that would work anyway. The best I have done so far is adding two lil_matices (the block is created as an lil-matrix for fancy indexing) A = A + Apartial, but as the master copy grows this means I think that I will need double the final memory requirement for A (to add the last block). Is there a better way of doing this? Also, what are the memory requirements for the conversions (.tocsc, .tocsr etc.)? Will that mean I need double the memory anyway? Thanks, Robin
On Sat, May 3, 2008 at 12:07 PM, Robin <robince@gmail.com> wrote:
Hi,
I was wondering what the most (memory) efficient way of combining two sparse matrices would be.
I am constructing a very large sparse matrix, but due to the temporary memory required to calculate the entries I am doing it in blocks, with the computation of each block done in a forked child process. This returns a sparse matrix of the same dimensions as the full one, but with a smaller number of entries. I would like to add the entries from the block result to the 'master' copy. I can be sure that there will be no overlap in the position of entries (ie no matrix position will be in both sides).
What is the most memory efficient way of combining these? I noticed += isn't implemented, but it's not clear how that would work anyway. The best I have done so far is adding two lil_matices (the block is created as an lil-matrix for fancy indexing) A = A + Apartial, but as the master copy grows this means I think that I will need double the final memory requirement for A (to add the last block). Is there a better way of doing this?
Also, what are the memory requirements for the conversions (.tocsc, .tocsr etc.)? Will that mean I need double the memory anyway?
After reading some more about the different sparse matrix types I am now trying having both master and block matrices as dok, passing back dict(Apartial) since cPickle can't pickle dok matrices, and then doing A.update(Apartial). This seems to be a bit slower (which is OK) but also the memory used seems to be growing a little fast - it seems as though something is not being released (I do del Apartial after the update but I have a feeling it might be sticking around) and I'm worried the machine will fall over again before it's finished. I'd still appreciate some advice as to the best way to do this. I thought of directly appending to the lists in the lil_matrix, but I would then have to sort them again and I wasn't sure if the object array could take resizing like this. Robin
On Sat, May 3, 2008 at 10:16 AM, Robin <robince@gmail.com> wrote:
I was wondering what the most (memory) efficient way of combining two sparse matrices would be.
I'd still appreciate some advice as to the best way to do this. I thought of directly appending to the lists in the lil_matrix, but I would then have to sort them again and I wasn't sure if the object array could take resizing like this.
If you're worried about speed or memory consumption, then you should avoid lil_matrix and dok_matrix. The fastest and most memory efficient approach uses coo_matrix: row = array([2,3,1,4]) col = array([4,2,3,1]) data = array([5,5,5,5]) A = coo_matrix( (data,(row,col)), shape=(5,5)).tocsr() The conversion to CSR forces a copy, however CSR and COO are so much more memory efficient than LIL/DOK that it shouldn't matter. -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
On Sat, May 3, 2008 at 4:24 PM, Nathan Bell <wnbell@gmail.com> wrote:
If you're worried about speed or memory consumption, then you should avoid lil_matrix and dok_matrix.
The fastest and most memory efficient approach uses coo_matrix:
row = array([2,3,1,4]) col = array([4,2,3,1]) data = array([5,5,5,5]) A = coo_matrix( (data,(row,col)), shape=(5,5)).tocsr()
The conversion to CSR forces a copy, however CSR and COO are so much more memory efficient than LIL/DOK that it shouldn't matter.
Thanks - I was just thinking about COO matrices. The problem is I have to build it up in parts - I know lists are more efficient to extend than arrays so I could store the data, row, col as lists and extend with each chunk - then I think I can construct the COO matrix from the list. I was worried that the lists would stay around though - even if deleted so then I wouldn't have enough memory to do the COO-CSC conversion. Perhaps I could add another layer of indirection with another fork() though to make sure the lists are cleaned up and just pass back the COO matrix - hopefully the COO matrix will pickle OK. Thanks, Robin
On 3-May-08, at 11:47 AM, Robin wrote:
Perhaps I could add another layer of indirection with another fork() though to make sure the lists are cleaned up and just pass back the COO matrix - hopefully the COO matrix will pickle OK.
You probably don't want to use pickle, both save and load will be pretty slow. I would call .tofile() on each of yourmatrix.row, yourmatrix.col and yourmatrix.data. Cheers, David
2008/5/3 Robin <robince@gmail.com>:
Hi,
I was wondering what the most (memory) efficient way of combining two sparse matrices would be.
I am constructing a very large sparse matrix, but due to the temporary memory required to calculate the entries I am doing it in blocks, with the computation of each block done in a forked child process. This returns a sparse matrix of the same dimensions as the full one, but with a smaller number of entries. I would like to add the entries from the block result to the 'master' copy. I can be sure that there will be no overlap in the position of entries (ie no matrix position will be in both sides).
What is the most memory efficient way of combining these? I noticed += isn't implemented, but it's not clear how that would work anyway. The best I have done so far is adding two lil_matices (the block is created as an lil-matrix for fancy indexing) A = A + Apartial, but as the master copy grows this means I think that I will need double the final memory requirement for A (to add the last block). Is there a better way of doing this?
Also, what are the memory requirements for the conversions (.tocsc, .tocsr etc.)? Will that mean I need double the memory anyway?
Sparse matrices, in any of the formats implemented in numpy/scipy, are quite inefficient when dealing with subblocks. Every single entry in the block must have at the least a four-byte index stored alongside it. This need not necessarily be a problem, but it seems to be for you. Perhaps you would be better off working with the blocks, represented as dense arrays, directly? Or if the direct approach is too cumbersome, you could write a quick(*) subclass that lets you index the block as if it were part of a larger matrix. (*) Actually I'm not sure just how quick it would be, but overriding __getitem__ and __setitem__ plus the usual initialization stuff ought to do the job. Anne
On Sat, May 3, 2008 at 4:21 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
Sparse matrices, in any of the formats implemented in numpy/scipy, are quite inefficient when dealing with subblocks. Every single entry in the block must have at the least a four-byte index stored alongside it. This need not necessarily be a problem, but it seems to be for you.
FWIW there's now a BSR (block CSR) format in scipy.sparse for matrices with dense sub-blocks. -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
participants (4)
-
Anne Archibald -
David Warde-Farley -
Nathan Bell -
Robin