[Numpy-discussion] sparse array data

Thu May 3 04:39:35 EDT 2012

On 05/03/2012 03:25 AM, Travis Oliphant wrote:
>
> On May 2, 2012, at 5:28 PM, Stéfan van der Walt wrote:
>
>> On Wed, May 2, 2012 at 3:20 PM, Francesc Alted<francesc at continuum.io>  wrote:
>>> On 5/2/12 4:07 PM, Stéfan van der Walt wrote:
>>> Well, as the OP said, coo_matrix does not support dimensions larger than
>>> 2, right?
>>
>> That's just an implementation detail, I would imagine--I'm trying to
>> figure out if there is a new principle behind "synthetic dimensions"?
>> By the way, David Cournapeau mentioned using b-trees for sparse ops a
>> while ago; did you ever talk to him about those ideas?
>
> The only new principle (which is not strictly new --- but new to NumPy's world-view) is using one (or more) fields of a structured array as "synthetic dimensions" which replace 1 or more of the raw table dimensions.     Thus, you could create a "view" of a NumPy array (or a group of NumPy arrays) where 1 or more dimensions is replaced with these "sparse dimensions".      This is a fully-general way to handle a mixture of sparse and dense structures in one general array interface.
>
> However, you lose the O(1) lookup as now you must search for the non-zero items in order to implement algorithms (indexes are critical and Francesc has some nice indexes in PyTables).
>
> A group-by operation can be replaced by an operation on "a sparse dimension" where you have mapped attributes to 1 or more dimensions in the underlying array.
>
> coo_matrix is just a special case of this more general idea.    If you add the ability to compress attributes, then you get csr, csc, and various other forms of matrices as well.
>
> More to come....  If you are interested in this sort of thing please let me know....

I am very interested in such developments, since a NumPy axis very often 
doesn't mean anything to itself to me, but I'll do stuff like

arr[l*l + l + m - lmin * lmin] = f(l, m)

Something like a function you could register that maps virtual 
dimensions to actual dimensions would be great:

void lm_to_idx(Py_ssize_t *virtual_indices, Py_ssize_t *indices,
                void *ctx) {
     Py_ssize_t l, m, lmin;
     lmin = ((my_context_t*)ctx)->lmin;
     for (Py_ssize_t i = 0; i != NPY_INDEX_TRANSLATION_BLOCKSIZE; ++i) {
         l = virtual_indices[2 * i];
         m = virtual_indiecs[2 * i + 1];
         indices[i] = l * l + l + m - lmin * lmin;
     }
}

Then register the function with an array (somehow), and then you can do

arr[l, m] = f(l, m)

:-)

But, doing it manually isn't *that* much of a pain, I won't be putting 
in any work in this direction myself.

Dag