Re: [Numpy-discussion] Alternative to record array

29 Dec 2008

      A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
...
Hello,
I like to use record arrays to access fields by their name, and
because they are esay to use with pytables. But I think it's not very
effiicient for what I have to do. Maybe I'm misunderstanding
something.
Example :
import numpy as np
age = np.random.randint(0, 99, 10e6)
weight = np.random.randint(0, 200, 10e6)
data = np.rec.fromarrays((age, weight), names='age, weight')
# the kind of operations I do is :
data.age += data.age + 1
# but it's far less efficient than doing :
age += 1
# because I think the record array stores [(age_0, weight_0)
...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ...
weight_n].
So I think I don't use record arrays for the right purpose. I only
need something which would make me esasy to manipulate data by
accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have
to implement my own class, with something like :
class FieldArray:
    def __init__(self, array_dict):
        self.array_list = array_dict
def __getitem__(self, field):
        return self.array_list[field]
def __setitem__(self, field, value):
        self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight}
data = FieldArray(my_arrays)
data['age'] += 1
That's a very good question.  What you are observing are the effects of 
arranging a dataset by fields (row-wise) or by columns (column-wise).  
A record array in numpy arranges data by field, so that in your 'data' 
array the data is placed in memory as follows:

data['age'][0] --> data['weight'][0] -->
data['age'][1] --> data['weight'][1] -->
...

while in your 'FieldArray' class, data is arranged by column and is 
placed in memory as:

data['age'][0] --> data['age'][1] --> ... -->
data['weight'][0] --> data['weight'][1] --> ...

The difference for both approaches is that the row-wise arrangement is 
more efficient when data is iterated by field, while the column-wise 
one is more efficient when data is iterated by column.  This is why you 
are seeing the increase of 4x in performance --incidentally, by looking 
at both data arrangements, I'd expect an increase of just 2x (the 
stride count is 2 in this case), but I suspect that there are hidden 
copies during the increment operation for the record array case.

So you are perfectly right.  In some situations you may want to use a 
row-wise arrangement (record array) and in other situations a 
column-wise one.  So, it would be handy to have some code to convert 
back and forth between both data arrangements.  Here it goes a couple 
of classes for doing this (they are a quick-and-dirty generalization of 
your code):

class ColArray:
    def __init__(self, recarray):
        dictarray = {}
        if isinstance(recarray, np.ndarray):
            fields = recarray.dtype.fields
        elif isinstance(recarray, RecArray):
            fields = recarray.fields
        else:
            raise TypeError, "Unrecognized input type!"
        for colname in fields:
            # For optimum performance you should 'copy' the column!
            dictarray[colname] = recarray[colname].copy()
        self.dictarray = dictarray

    def __getitem__(self, field):
        return self.dictarray[field]

    def __setitem__(self, field, value):
        self.dictarray[field] = value

    def iteritems(self):
        return self.dictarray.iteritems()

class RecArray:
    def __init__(self, dictarray):
        ldtype = []
        fields = []
        for colname, column in dictarray.iteritems():
            ldtype.append((colname, column.dtype))
            fields.append(colname)
            collen = len(column)
        dt = np.dtype(ldtype)
        recarray = np.empty(collen, dtype=dt)
        for colname, column in dictarray.iteritems():
            recarray[colname] = column
        self.recarray = recarray
        self.fields = fields

    def __getitem__(self, field):
        return self.recarray[field]

    def __setitem__(self, field, value):
        self.recarray[field] = value

So, ColArray takes as parameter a record array or RecArray class that 
have a row-wise arrangement and returns an object that is column-wise. 
RecArray does the inverse trip on the ColArray that takes as parameter.

A small example of use:

N = 10e6
age = np.random.randint(0, 99, N)
weight = np.random.randint(0, 200, N)

# Get an initial record array
dt = np.dtype([('age', np.int_), ('weight', np.int_)])
data = np.empty(N, dtype=dt)
data['age'] = age
data['weight'] = weight

t1 = time()
data['age'] += 1
print "time for initial recarray:", round(time()-t1, 3)

data = ColArray(data)
t1 = time()
data['age'] += 1
print "time for ColArray:", round(time()-t1, 3)

data = RecArray(data)
t1 = time()
data['age'] += 1
print "time for reconstructed RecArray:", round(time()-t1, 3)

data = ColArray(data)
t1 = time()
data['age'] += 1
print "time for reconstructed ColArray:", round(time()-t1, 3)

and the output is:

time for initial recarray: 0.298
time for ColArray: 0.076
time for reconstructed RecArray: 0.3
time for reconstructed ColArray: 0.076

So, these classes offers a quick way to go back and forth between both 
data arrangements, and can be used whenever a representation is found 
to be more useful.  Indeed, you must be aware that the conversion takes 
time, and that it is generally a bad idea to do it just to do an 
operation.  But when you must to operate a lot, a conversion makes a 
lot of sense.

In fact, my hunch is that the column-wise arrangement is far more useful 
in general for accelerating operations in heterogeneous arrays, because 
what people normally do is operating column-wise and not row-wise.  If 
this is actually the case, it would be a good idea to introduce a 
first-class type in numpy implementing a column-wise heterogeneous 
array.  If this is found to be too cumbersome, perhaps integrating some 
utilities to do the conversion (similar in spirit to the classes 
above), would fit the bill.

Cheers,

-- 
Francesc Alted