[SciPy-user] scipy.sparse: coo_matrix ignores sum_duplicates=False

Nathan Bell wnbell at gmail.com
Mon Oct 13 22:57:15 EDT 2008


On Mon, Oct 13, 2008 at 6:33 PM, James Philbin <philbinj at gmail.com> wrote:
>
> same. I think dok_matrix is fine for my needs. BTW, i've found that
> __setitem__ is v slow for dok_matrix. Is this just because of the
> checks which are made? Using dict.__setitem__(mat, (r,c), val) is
> about an order of magnitude faster.

I don't use dok_matrix, so I don't know why it would be that much
slower.  If you can speed it up and submit a patch I'd happily apply
it.

> I'm not arguing that summing duplicate entries is not desirable. I'm
> just arguing that a function which reads .tocsr(sum_duplicates=False)
> and then sums the duplicates implicitly is misnamed.

Please understand, it *does not* sum the duplicates.  As I illustrated
before, the duplicates are carried over to the CSR format.  It's just
that CSR->dense *does* sum duplicates.

I agree that sum_duplicates=False is somewhat ambiguous, do you have a
suggestion for how this could be made more clear?  For instance, would
an interface like:
  coo_matrix.tocsr(duplicates='sum')
  coo_matrix.tocsr(duplicates='last')
  coo_matrix.tocsr(duplicates='max')
be preferred?  If I understand correctly, you'd want to use
.tocsr(duplicates='last').

Another question is whether we want to put this in the COO->CSR (and
CSC) conversions.  At this point, I think COO->CSR should *always* sum
duplicates together and we should instead provide a separate function
or member function of coo_matrix that provides additional options,
like 'last', 'max', etc.  In general, any binary operator (T,T) -> T
could be used as an accumulator, but we would provide the most common
options.

-- 
Nathan Bell wnbell at gmail.com
http://graphics.cs.uiuc.edu/~wnbell/



More information about the SciPy-User mailing list