[Numpy-discussion] Is there a more efficient way to do this?

Wed Aug 8 10:19:04 EDT 2012

Is there a more efficient way to calculate the "slices" array below?

import numpy
import numpy.random

# In reality, this is between 1 and 50.
DIMENSIONS = 20

# In my real app, I have 100...1M data rows.
ROWS = 1000
DATA = numpy.random.random_integers(0,100,(ROWS,DIMENSIONS))

# This is between 0..DIMENSIONS-1
DRILLBY = 3

# Array of row incides that orders the data by the given dimension.
o = numpy.argsort(DATA[:,DRILLBY])

# Input of my task: the data ordered by the given dimension.
print DATA[o,DRILLBY]

#~ [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   1 1   1   1
    #~ 1   1   1   1   1   1   1   1   2   2   2   2   2   2   2 2   2   2
    #~ 2   3   3   3   3   3   3   3   4   4   4   4   4   4   4 4   4   4
    #~ 4   4   4   4   5   5   5   5   5   5   5   5   5   5   6 6   6   6
#~ .... many more things here
   #~ 96  96  96  97  97  97  97  97  97  97  97  97  98  98  98  98 98  98
   #~ 99  99  99  99  99  99  99  99  99  99  99  99  99  99  99  99 100 100
  #~ 100 100 100 100 100 100 100 100 100 100]

# Output of my task: determine slices for the same values on the DRILLBY 
dimension.

slices = []
prev_val = None
sidx = -1
# Dimension values for the given dimension.
fdv = DATA[:,DRILLBY]

# Go over the rows, sorted by values of didx
for oidx,rowidx in enumerate(o):
      val = fdv[rowidx]
      if val!=prev_val:
         if prev_val is None:
             prev_val = val
             sidx = oidx
         else:
             slices.append((prev_val,sidx,oidx))
         sidx = oidx
         prev_val = val

if (sidx>=0) and (sidx<ROWS):
      slices.append((val,sidx,ROWS))
slices = numpy.array(slices,dtype=numpy.int64)

# This is what I want to have!
print slices

#~
#~ [[   0    0   14]
  #~ [   1   14   26]
  #~ [   2   26   37]
  #~ [   3   37   44]
#~ .... many more values here
  #~ [   4   44   58]
  #~ [  96  952  957]
  #~ [  97  957  966]
  #~ [  98  966  972]
  #~ [  99  972  988]
  #~ [ 100  988 1000]]

So for example, to get all row incides where dimension value is zero: 
zeros at rows o[0:14]
Or, to get all row incides where dimension value is 99: o[988:1000] etc.

I do not want to make copies of DATA, because it can be huge. The 
argsort is fast enough. I just need to create slices for different 
dimensions. The above code works, but it does a linear time search, 
implemented in pure Python code. For every iteration, Python code is 
executed. For 1 million rows, this is very slow. Is there a way to 
produce "slices" with numpy code? I could write C code for this, but I 
would prefer to do it with mass numpy operations.

Thanks,

   Laszlo