Mailman 3 creating sparse indicator arrays - SciPy-User

creating sparse indicator arrays

josef.pktd＠gmail.com

Nov. 29, 2011

3:14 p.m.

Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?

...

...
...
from scipy import sparse g = np.array([0, 0, 1, 1]) #categories, integers, u = np.arange(2) #unique's, range(number_categories)

...

...
...
g[:,None] == u array([[ True, False], [ True, False], [False, True], [False, True]], dtype=bool)

this is the one I want:

...

...
...
a = sparse.csc_matrix((g[:,None] == u)) a <4x2 sparse matrix of type '<type 'numpy.int8'>' with 4 stored elements in Compressed Sparse Column format> a.todense() matrix([[1, 0], [1, 0], [0, 1], [0, 1]], dtype=int8)

Thanks, Josef

Show replies by date

Nathaniel Smith

November 2011

7:07 p.m.

On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd@gmail.com> wrote:

...

Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?

The standard way is to use the LIL or DOK sparse formats. If you want to use them then you'll have to do your construction "by hand", though -- you can't do the nice broadcasting tricks you're using below. Alternatively, constructing CSC or CSR format directly is not that hard, though it may take some time to wrap your head around the definitions...

...

...
...
...
from scipy import sparse g = np.array([0, 0, 1, 1]) #categories, integers, u = np.arange(2) #unique's, range(number_categories)

If 'u' is *always* going to be np.arange(number_categories), then actually this is quite trivial (untested code): data = np.ones(len(g), dtype=np.int8) indices = g indptr = np.arange(len(g)) a = np.csr_matrix((data, indices, indptr)) This gives you a CSR matrix, which you can either use as is or convert to CSC. If you want to build CSC directly, and want to support an arbitrary 'u' vector, then you could do something like (untested code): data = np.ones(len(g), dtype=np.int8) indices = np.empty(len(g), dtype=int) write_offset = 0 indptr = np.empty(number_categories, dtype=int) for col_i, category in enumerate(u): indptr[col_i] = write_offset rows = (data == category).nonzero()[0] indices[write_offset:write_offset + len(rows)] = rows write_offset += len(rows) Or you could just use a loop that fills in an LIL matrix :-) -- Nathaniel

josef.pktd＠gmail.com

7:50 p.m.

On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd@gmail.com> wrote:

...
Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?

The standard way is to use the LIL or DOK sparse formats. If you want to use them then you'll have to do your construction "by hand", though -- you can't do the nice broadcasting tricks you're using below. Alternatively, constructing CSC or CSR format directly is not that hard, though it may take some time to wrap your head around the definitions...

...
...
...
...
from scipy import sparse g = np.array([0, 0, 1, 1]) #categories, integers, u = np.arange(2) #unique's, range(number_categories)

If 'u' is *always* going to be np.arange(number_categories), then actually this is quite trivial (untested code):

data = np.ones(len(g), dtype=np.int8) indices = g indptr = np.arange(len(g)) a = np.csr_matrix((data, indices, indptr))

This works nicely (only "sparse" namespace) u = np.arange(number_categories) will be a code requirement (group or period labels are consecutive ints)

...

This gives you a CSR matrix, which you can either use as is or convert to CSC.

If you want to build CSC directly, and want to support an arbitrary 'u' vector, then you could do something like (untested code):

data = np.ones(len(g), dtype=np.int8) indices = np.empty(len(g), dtype=int) write_offset = 0 indptr = np.empty(number_categories, dtype=int) for col_i, category in enumerate(u): indptr[col_i] = write_offset rows = (data == category).nonzero()[0] indices[write_offset:write_offset + len(rows)] = rows write_offset += len(rows)

I still need to check this.

...

Or you could just use a loop that fills in an LIL matrix :-)

I'm playing with panel data or general error component models. The main point of using sparse is to have a compact, non-loop version. In some previous attempts at sparse the cost of constructing the array with loops removed much of the advantage of using them, and I could just loop in the algorithm directly. Thanks, Josef

...

-- Nathaniel _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

josef.pktd＠gmail.com

8:25 p.m.

On Tue, Nov 29, 2011 at 2:50 PM, <josef.pktd@gmail.com> wrote:

...

On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd@gmail.com> wrote:

...
Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?

The standard way is to use the LIL or DOK sparse formats. If you want to use them then you'll have to do your construction "by hand", though -- you can't do the nice broadcasting tricks you're using below. Alternatively, constructing CSC or CSR format directly is not that hard, though it may take some time to wrap your head around the definitions...

...
...
...
...
from scipy import sparse g = np.array([0, 0, 1, 1]) #categories, integers, u = np.arange(2) #unique's, range(number_categories)

If 'u' is *always* going to be np.arange(number_categories), then actually this is quite trivial (untested code):

data = np.ones(len(g), dtype=np.int8) indices = g indptr = np.arange(len(g)) a = np.csr_matrix((data, indices, indptr))

This works nicely (only "sparse" namespace)

small correction: indptr needs to be 1 longer or it drops the last row (since I didn't RTFM, I don't know what it means, but it works)

...

...
...
g = np.array([0, 0, 1, 2, 1, 1, 2, 0]) data = np.ones(len(g), dtype=np.int8) indptr = np.arange(len(g)+1) #add 1 a = sparse.csr_matrix((data, g, indptr)) a.todense() matrix([[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]], dtype=int8) np.all(a.todense() == (g[:,None] == np.arange(3)).astype(int)) True

Josef

...

u = np.arange(number_categories) will be a code requirement (group or period labels are consecutive ints)

...
This gives you a CSR matrix, which you can either use as is or convert to CSC.

If you want to build CSC directly, and want to support an arbitrary 'u' vector, then you could do something like (untested code):

data = np.ones(len(g), dtype=np.int8) indices = np.empty(len(g), dtype=int) write_offset = 0 indptr = np.empty(number_categories, dtype=int) for col_i, category in enumerate(u): indptr[col_i] = write_offset rows = (data == category).nonzero()[0] indices[write_offset:write_offset + len(rows)] = rows write_offset += len(rows)

I still need to check this.

...
Or you could just use a loop that fills in an LIL matrix :-)

I'm playing with panel data or general error component models. The main point of using sparse is to have a compact, non-loop version.

In some previous attempts at sparse the cost of constructing the array with loops removed much of the advantage of using them, and I could just loop in the algorithm directly.

Thanks,

Josef

...
-- Nathaniel _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

josef.pktd＠gmail.com

December 2011

2:04 a.m.

On Tue, Nov 29, 2011 at 3:25 PM, <josef.pktd@gmail.com> wrote:

...

On Tue, Nov 29, 2011 at 2:50 PM, <josef.pktd@gmail.com> wrote:

...
On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd@gmail.com> wrote:

...
Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?

The standard way is to use the LIL or DOK sparse formats. If you want to use them then you'll have to do your construction "by hand", though -- you can't do the nice broadcasting tricks you're using below. Alternatively, constructing CSC or CSR format directly is not that hard, though it may take some time to wrap your head around the definitions...

...
...
...
> from scipy import sparse > g = np.array([0, 0, 1, 1]) #categories, integers, > u = np.arange(2) #unique's, range(number_categories)

If 'u' is *always* going to be np.arange(number_categories), then actually this is quite trivial (untested code):

data = np.ones(len(g), dtype=np.int8) indices = g indptr = np.arange(len(g)) a = np.csr_matrix((data, indices, indptr))

This works nicely (only "sparse" namespace)

small correction: indptr needs to be 1 longer or it drops the last row (since I didn't RTFM, I don't know what it means, but it works)

...
...
...
g = np.array([0, 0, 1, 2, 1, 1, 2, 0]) data = np.ones(len(g), dtype=np.int8) indptr = np.arange(len(g)+1) #add 1 a = sparse.csr_matrix((data, g, indptr)) a.todense() matrix([[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]], dtype=int8) np.all(a.todense() == (g[:,None] == np.arange(3)).astype(int)) True

Josef

...
u = np.arange(number_categories) will be a code requirement (group or period labels are consecutive ints)

...
This gives you a CSR matrix, which you can either use as is or convert to CSC.

If you want to build CSC directly, and want to support an arbitrary 'u' vector, then you could do something like (untested code):

data = np.ones(len(g), dtype=np.int8) indices = np.empty(len(g), dtype=int) write_offset = 0 indptr = np.empty(number_categories, dtype=int) for col_i, category in enumerate(u): indptr[col_i] = write_offset rows = (data == category).nonzero()[0] indices[write_offset:write_offset + len(rows)] = rows write_offset += len(rows)

I still need to check this.

...
Or you could just use a loop that fills in an LIL matrix :-)

I'm playing with panel data or general error component models. The main point of using sparse is to have a compact, non-loop version.

In some previous attempts at sparse the cost of constructing the array with loops removed much of the advantage of using them, and I could just loop in the algorithm directly.

finally got started with some timing groupsums: 200,000 observations, 20 variables, 500 groups sparse looks good, even better than bincount In [69]: %timeit a = sparse.csr_matrix((data, g, np.arange(len(g)+1))) 1000 loops, best of 3: 732 us per loop In [64]: a Out[64]: <200000x500 sparse matrix of type '<type 'numpy.int8'>' with 200000 stored elements in Compressed Sparse Row format> In [65]: %timeit x.T * a #sparse 100 loops, best of 3: 6.28 ms per loop In [66]: %timeit np.array([np.bincount(g, weights=x[:,col]) for col in range(x.shape[1])]) 10 loops, best of 3: 57.5 ms per loop In [67]: %timeit np.dot(xT, indi) 1 loops, best of 3: 1.29 s per loop In [72]: %timeit for cat in u: x[g==cat].sum(0) 1 loops, best of 3: 635 ms per loop In [68]: indi.shape Out[68]: (200000, 500) In [70]: x.shape Out[70]: (200000, 20) In [73]: res_dot = np.dot(xT, indi) In [74]: res_sparse = x.T * a In [75]: np.max(np.abs(res_dot - res_sparse)) Out[75]: 0.0 nice Josef

...

...
Thanks,

Josef

...
-- Nathaniel _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

4824

Age (days ago)

4826

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

josef.pktd＠gmail.com
Nathaniel Smith

creating sparse indicator arrays

josef.pktd＠gmail.com

Nathaniel Smith

josef.pktd＠gmail.com

josef.pktd＠gmail.com

josef.pktd＠gmail.com

tags

participants (2)