return unique combinations of stacked arrays - slow

I'm trying to create an output array of integers where each value represents a unique combination of values from (1..n) input arrays. As a simple example, given these three arrays: a = np.array([0, 1, 2, 3, 0, 1, 2, 3]) b = np.array([0, 1, 0, 1, 0, 1, 0, 1]) c = np.array([0, 1, 1, 0, 0, 1, 0, 1]) I want an output array that holds 'codes' for the unique combinations and a dictionary that holds the unique combinations as keys and codes as values. out = np.array([0, 1, 2, 3, 0, 1, 4, 5]) out_dict = { (0, 0, 0): 0, (1, 1, 1): 1, (2, 0, 1): 2, (3, 1, 0): 3, (2, 0, 0): 4, (3, 1, 1): 5, } An additional constraint is that I'm bringing in the (a, b, c) arrays a chunk at a time due to memory limits (ie. very large rasters) and so I need to retain the mapping between chunks. My current (very naive and pretty slow) implementation in loop form is: out_dict = {} out = np.zeros_like(a) count = 0 stack = np.vstack((a, b, c)).T for (i, arr) in enumerate(stack): t = tuple(arr) if t not in out_dict: out_dict[t] = count count += 1 out[i] = out_dict[t] Thanks for help, matt

On Tue, Oct 21, 2014 at 4:18 PM, Matt Gregory <matt.gregory@oregonstate.edu> wrote:
I'm trying to create an output array of integers where each value represents a unique combination of values from (1..n) input arrays. As a simple example, given these three arrays:
a = np.array([0, 1, 2, 3, 0, 1, 2, 3]) b = np.array([0, 1, 0, 1, 0, 1, 0, 1]) c = np.array([0, 1, 1, 0, 0, 1, 0, 1])
I want an output array that holds 'codes' for the unique combinations and a dictionary that holds the unique combinations as keys and codes as values.
out = np.array([0, 1, 2, 3, 0, 1, 4, 5]) out_dict = { (0, 0, 0): 0, (1, 1, 1): 1, (2, 0, 1): 2, (3, 1, 0): 3, (2, 0, 0): 4, (3, 1, 1): 5, }
An additional constraint is that I'm bringing in the (a, b, c) arrays a chunk at a time due to memory limits (ie. very large rasters) and so I need to retain the mapping between chunks.
My current (very naive and pretty slow) implementation in loop form is:
out_dict = {} out = np.zeros_like(a) count = 0 stack = np.vstack((a, b, c)).T for (i, arr) in enumerate(stack): t = tuple(arr) if t not in out_dict: out_dict[t] = count count += 1 out[i] = out_dict[t]
Thanks for help, matt
See http://stackoverflow.com/questions/23268605/grouping-indices-of-unique-eleme... for some ideas. the main difference is that you can't fit everything in memory, but if there are lots of duplicates you should be able to do it in batches, then combine the batches and repeat.
Another possibility if the elements are bounded is to treat them as digits in some number system and evaluate that number, i.e., dot with something like array([1, 10, 100, ...]). Chuck
participants (2)
-
Charles R Harris
-
Matt Gregory