A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field]
def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
That's a very good question. What you are observing are the effects of arranging a dataset by fields (row-wise) or by columns (column-wise). A record array in numpy arranges data by field, so that in your 'data' array the data is placed in memory as follows: data['age'][0] --> data['weight'][0] --> data['age'][1] --> data['weight'][1] --> ... while in your 'FieldArray' class, data is arranged by column and is placed in memory as: data['age'][0] --> data['age'][1] --> ... --> data['weight'][0] --> data['weight'][1] --> ... The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case. So you are perfectly right. In some situations you may want to use a row-wise arrangement (record array) and in other situations a column-wise one. So, it would be handy to have some code to convert back and forth between both data arrangements. Here it goes a couple of classes for doing this (they are a quick-and-dirty generalization of your code): class ColArray: def __init__(self, recarray): dictarray = {} if isinstance(recarray, np.ndarray): fields = recarray.dtype.fields elif isinstance(recarray, RecArray): fields = recarray.fields else: raise TypeError, "Unrecognized input type!" for colname in fields: # For optimum performance you should 'copy' the column! dictarray[colname] = recarray[colname].copy() self.dictarray = dictarray def __getitem__(self, field): return self.dictarray[field] def __setitem__(self, field, value): self.dictarray[field] = value def iteritems(self): return self.dictarray.iteritems() class RecArray: def __init__(self, dictarray): ldtype = [] fields = [] for colname, column in dictarray.iteritems(): ldtype.append((colname, column.dtype)) fields.append(colname) collen = len(column) dt = np.dtype(ldtype) recarray = np.empty(collen, dtype=dt) for colname, column in dictarray.iteritems(): recarray[colname] = column self.recarray = recarray self.fields = fields def __getitem__(self, field): return self.recarray[field] def __setitem__(self, field, value): self.recarray[field] = value So, ColArray takes as parameter a record array or RecArray class that have a row-wise arrangement and returns an object that is column-wise. RecArray does the inverse trip on the ColArray that takes as parameter. A small example of use: N = 10e6 age = np.random.randint(0, 99, N) weight = np.random.randint(0, 200, N) # Get an initial record array dt = np.dtype([('age', np.int_), ('weight', np.int_)]) data = np.empty(N, dtype=dt) data['age'] = age data['weight'] = weight t1 = time() data['age'] += 1 print "time for initial recarray:", round(time()-t1, 3) data = ColArray(data) t1 = time() data['age'] += 1 print "time for ColArray:", round(time()-t1, 3) data = RecArray(data) t1 = time() data['age'] += 1 print "time for reconstructed RecArray:", round(time()-t1, 3) data = ColArray(data) t1 = time() data['age'] += 1 print "time for reconstructed ColArray:", round(time()-t1, 3) and the output is: time for initial recarray: 0.298 time for ColArray: 0.076 time for reconstructed RecArray: 0.3 time for reconstructed ColArray: 0.076 So, these classes offers a quick way to go back and forth between both data arrangements, and can be used whenever a representation is found to be more useful. Indeed, you must be aware that the conversion takes time, and that it is generally a bad idea to do it just to do an operation. But when you must to operate a lot, a conversion makes a lot of sense. In fact, my hunch is that the column-wise arrangement is far more useful in general for accelerating operations in heterogeneous arrays, because what people normally do is operating column-wise and not row-wise. If this is actually the case, it would be a good idea to introduce a first-class type in numpy implementing a column-wise heterogeneous array. If this is found to be too cumbersome, perhaps integrating some utilities to do the conversion (similar in spirit to the classes above), would fit the bill. Cheers, -- Francesc Alted