Mailman 3 Determine slices in a sorted array - NumPy-Discussion

1 Jul 2010

      Given an array with two axes, sorted by a column 'SLICE_BY', how can I 
extract slice indexes for rows with the same 'SLICE_BY' value?

Here is an example program, demonstrating the problem:

from numpy import *

a = random.randint(0,100,(20,4))
SLICE_BY = 0 # Make slices of array 'a' by column SLICE_BY

a.sort(SLICE_BY)
slices = []
prev_val = None
sidx = -1
for rowidx,row in enumerate(a):
     val = row[SLICE_BY]
     if val!=prev_val:
         if prev_val is None:
             prev_val = val
             sidx = rowidx
         else:
             slices.append((prev_val,sidx,rowidx))
         sidx = rowidx
         prev_val = val

if sidx<a.shape[0]-1:
     slices.append((val,sidx,a.shape[0]))

print a
print slices

This program would print:

[[ 1  0  8  1]
  [ 4  5 17  9]
  [ 4 11 19 23]
  [11 12 24 23]
  [13 16 28 23]
  [14 26 29 36]
  [15 33 32 37]
  [20 38 38 40]
  [28 47 47 45]
  [33 50 50 57]
  [45 55 52 65]
  [47 67 60 65]
  [56 76 71 68]
  [61 76 71 78]
  [70 83 82 83]
  [89 83 84 85]
  [91 84 85 87]
  [95 96 86 88]
  [98 96 89 88]
  [99 98 92 88]]
[(1, 0, 1), (4, 1, 3), (11, 3, 4), (13, 4, 5), (14, 5, 6), (15, 6, 7), 
(20, 7, 8), (28, 8, 9), (33, 9, 10), (45, 10, 11), (47, 11, 12), (56, 
12, 13), (61, 13, 14), (70, 14, 15), (89, 15, 16), (91, 16, 17), (95, 
17, 18), (98, 18, 19)]

Altough my demonstration program is functionally correct, it is not 
efficient. I need to do this with 10 million rows. Number of slices is 
relatively small (10 to 10000).

Is is possible to construct my "slices" with pure numpy functions? E.g. 
anything that does not involve big number of python bytecode 
instructions, constucting Python objects, referencing/dereferencing 10 
million times etc.

Thanks,

   Laszlo

Determine slices in a sorted array

Laszlo Nagy

tags

participants (1)