![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Nov 29, 2011 at 2:50 PM, <josef.pktd@gmail.com> wrote:
On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd@gmail.com> wrote:
Is there a simple or fast way to create a sparse indicator array, `a` below, without going through the dense matrix first?
The standard way is to use the LIL or DOK sparse formats. If you want to use them then you'll have to do your construction "by hand", though -- you can't do the nice broadcasting tricks you're using below. Alternatively, constructing CSC or CSR format directly is not that hard, though it may take some time to wrap your head around the definitions...
from scipy import sparse g = np.array([0, 0, 1, 1]) #categories, integers, u = np.arange(2) #unique's, range(number_categories)
If 'u' is *always* going to be np.arange(number_categories), then actually this is quite trivial (untested code):
data = np.ones(len(g), dtype=np.int8) indices = g indptr = np.arange(len(g)) a = np.csr_matrix((data, indices, indptr))
This works nicely (only "sparse" namespace)
small correction: indptr needs to be 1 longer or it drops the last row (since I didn't RTFM, I don't know what it means, but it works)
g = np.array([0, 0, 1, 2, 1, 1, 2, 0]) data = np.ones(len(g), dtype=np.int8) indptr = np.arange(len(g)+1) #add 1 a = sparse.csr_matrix((data, g, indptr)) a.todense() matrix([[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]], dtype=int8) np.all(a.todense() == (g[:,None] == np.arange(3)).astype(int)) True
Josef
u = np.arange(number_categories) will be a code requirement (group or period labels are consecutive ints)
This gives you a CSR matrix, which you can either use as is or convert to CSC.
If you want to build CSC directly, and want to support an arbitrary 'u' vector, then you could do something like (untested code):
data = np.ones(len(g), dtype=np.int8) indices = np.empty(len(g), dtype=int) write_offset = 0 indptr = np.empty(number_categories, dtype=int) for col_i, category in enumerate(u): indptr[col_i] = write_offset rows = (data == category).nonzero()[0] indices[write_offset:write_offset + len(rows)] = rows write_offset += len(rows)
I still need to check this.
Or you could just use a loop that fills in an LIL matrix :-)
I'm playing with panel data or general error component models. The main point of using sparse is to have a compact, non-loop version.
In some previous attempts at sparse the cost of constructing the array with loops removed much of the advantage of using them, and I could just loop in the algorithm directly.
Thanks,
Josef
-- Nathaniel _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user