Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field]
def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
Thank you for the help,
Jean-Baptiste Rudant
Jean-Baptiste Rudant wrote:
Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
Sorry I am not able to answer your question; I am really a new user of numpy also.
It does seem the addition operation is more than 4 times slower, when using record arrays, based on the following:
import numpy, sys, timeit sys.version
'2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]'
numpy.__version__
'1.2.1'
count = 10e6 ages = numpy.random.randint(0,100,count) weights = numpy.random.randint(1,200,count) data = numpy.rec.fromarrays((ages,weights),names='ages,weights')
timer = timeit.Timer('data.ages += 1','from __main__ import data') timer.timeit(number=100)
30.110649537860262
timer = timeit.Timer('ages += 1','from __main__ import ages') timer.timeit(number=100)
6.9850710076280507
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
Thank you for the help,
Jean-Baptiste Rudant
Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Jean-Baptiste Rudant wrote:
Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
You can accomplish what your FieldArray class does using numpy dtypes:
import numpy as np dt = np.dtype([('age', np.int32), ('weight', np.int32)]) N = int(10e6) data = np.empty(N, dtype=dt) data['age'] = np.random.randint(0, 99, 10e6) data['weight'] = np.random.randint(0, 200, 10e6)
data['age'] += 1
Timing for recarrays (your code):
In [10]: timeit data.age += 1 10 loops, best of 3: 221 ms per loop
Timing for my example:
In [2]: timeit data['age']+=1 10 loops, best of 3: 150 ms per loop
Hope this helps.
Ryan
Jean-Baptiste, As you stated, everything depends on what you want to do. If you need to keep the correspondence age<>weight for each entry, then yes, record arrays, or at least flexible-type arrays, are the best. (The difference between a recarray and a flexible-type array is that fields can be accessed by attributes (data.age) or items (data['age']) with recarrays, but only with items with felxible-type arrays).
Using your example, you could very well do: data['age'] += 1 and still keep the correspondence age<>weight.
Your FieldArray class returns an object that is not a ndarray, which may have some undesired side-effects.
As Ryan noted, flexible-type arrays are usually faster, because they lack the overhead brought by the possibiity of accessing data by attributes. So, if you don't mind using the 'access-by-fields' syntax, you're good to go.
On Dec 29, 2008, at 10:58 AM, Jean-Baptiste Rudant wrote:
Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ... (age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
Thank you for the help,
Jean-Baptiste Rudant
Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
Hello,
I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something.
Example :
import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n].
So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name.
Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like :
class FieldArray: def __init__(self, array_dict): self.array_list = array_dict
def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value
my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays)
data['age'] += 1
That's a very good question. What you are observing are the effects of arranging a dataset by fields (row-wise) or by columns (column-wise). A record array in numpy arranges data by field, so that in your 'data' array the data is placed in memory as follows:
data['age'][0] --> data['weight'][0] --> data['age'][1] --> data['weight'][1] --> ...
while in your 'FieldArray' class, data is arranged by column and is placed in memory as:
data['age'][0] --> data['age'][1] --> ... --> data['weight'][0] --> data['weight'][1] --> ...
The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case.
So you are perfectly right. In some situations you may want to use a row-wise arrangement (record array) and in other situations a column-wise one. So, it would be handy to have some code to convert back and forth between both data arrangements. Here it goes a couple of classes for doing this (they are a quick-and-dirty generalization of your code):
class ColArray: def __init__(self, recarray): dictarray = {} if isinstance(recarray, np.ndarray): fields = recarray.dtype.fields elif isinstance(recarray, RecArray): fields = recarray.fields else: raise TypeError, "Unrecognized input type!" for colname in fields: # For optimum performance you should 'copy' the column! dictarray[colname] = recarray[colname].copy() self.dictarray = dictarray
def __getitem__(self, field): return self.dictarray[field]
def __setitem__(self, field, value): self.dictarray[field] = value
def iteritems(self): return self.dictarray.iteritems()
class RecArray: def __init__(self, dictarray): ldtype = [] fields = [] for colname, column in dictarray.iteritems(): ldtype.append((colname, column.dtype)) fields.append(colname) collen = len(column) dt = np.dtype(ldtype) recarray = np.empty(collen, dtype=dt) for colname, column in dictarray.iteritems(): recarray[colname] = column self.recarray = recarray self.fields = fields
def __getitem__(self, field): return self.recarray[field]
def __setitem__(self, field, value): self.recarray[field] = value
So, ColArray takes as parameter a record array or RecArray class that have a row-wise arrangement and returns an object that is column-wise. RecArray does the inverse trip on the ColArray that takes as parameter.
A small example of use:
N = 10e6 age = np.random.randint(0, 99, N) weight = np.random.randint(0, 200, N)
# Get an initial record array dt = np.dtype([('age', np.int_), ('weight', np.int_)]) data = np.empty(N, dtype=dt) data['age'] = age data['weight'] = weight
t1 = time() data['age'] += 1 print "time for initial recarray:", round(time()-t1, 3)
data = ColArray(data) t1 = time() data['age'] += 1 print "time for ColArray:", round(time()-t1, 3)
data = RecArray(data) t1 = time() data['age'] += 1 print "time for reconstructed RecArray:", round(time()-t1, 3)
data = ColArray(data) t1 = time() data['age'] += 1 print "time for reconstructed ColArray:", round(time()-t1, 3)
and the output is:
time for initial recarray: 0.298 time for ColArray: 0.076 time for reconstructed RecArray: 0.3 time for reconstructed ColArray: 0.076
So, these classes offers a quick way to go back and forth between both data arrangements, and can be used whenever a representation is found to be more useful. Indeed, you must be aware that the conversion takes time, and that it is generally a bad idea to do it just to do an operation. But when you must to operate a lot, a conversion makes a lot of sense.
In fact, my hunch is that the column-wise arrangement is far more useful in general for accelerating operations in heterogeneous arrays, because what people normally do is operating column-wise and not row-wise. If this is actually the case, it would be a good idea to introduce a first-class type in numpy implementing a column-wise heterogeneous array. If this is found to be too cumbersome, perhaps integrating some utilities to do the conversion (similar in spirit to the classes above), would fit the bill.
Cheers,
A Tuesday 30 December 2008, Francesc Alted escrigué:
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
[snip]
The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case.
As I was mystified about this difference in speed, I kept investigating and I think I have an answer for the difference in the expected speed-up in the unary increment operator over a recarray field. After looking at the numpy code, it turns out that the next statement:
data.ages += 1
is more or less equivalent to:
a = data.ages a[:] = a + 1
i.e. a temporary is created (for keeping the result of 'a + 1') and then assigned to the 'ages' column. As it happens that, in this sort of operations, the memory copies are the bottleneck, the creation of the first temporary introduced a slowdown of 2x (due to the strided column) and the assignment represents the additional 2x (4x in total). However, the next idiom:
a = data.ages a += 1
effectively removes the need for the temporary copy and is 2x faster than the original "data.ages += 1". This can be seen in the next simple benchmark:
--------------------------- import numpy, timeit
count = 10e6 ages = numpy.random.randint(0,100,count) weights = numpy.random.randint(1,200,count) data = numpy.rec.fromarrays((ages,weights),names='ages,weights')
timer = timeit.Timer('data.ages += 1','from __main__ import data') print "v0-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import data') print "v1-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data') print "v2-->", timer.timeit(number=10) timer = timeit.Timer('ages += 1','from __main__ import ages') print "v3-->", timer.timeit(number=10) ---------------------------
which produces the next output on my laptop:
v0--> 2.98340201378 v1--> 3.22748112679 v2--> 1.5474319458 v3--> 0.809724807739
As a final comment, I suppose that unary operators (+=, -=...) can be optimized in the context of recarray columns in numpy, but I don't think it is worth the effort: when really high performance is needed for operating with columns in the context of recarrays, a column-wise approach is best.
Cheers,
Thank you for everything, it works fine ant it is very helpful.
Regards,
Jean-Baptiste Rudant
________________________________ De : Francesc Alted faltet@pytables.org À : Discussion of Numerical Python numpy-discussion@scipy.org Envoyé le : Mardi, 30 Décembre 2008, 16h34mn 27s Objet : Re: [Numpy-discussion] Alternative to record array
A Tuesday 30 December 2008, Francesc Alted escrigué:
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
[snip]
The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case.
As I was mystified about this difference in speed, I kept investigating and I think I have an answer for the difference in the expected speed-up in the unary increment operator over a recarray field. After looking at the numpy code, it turns out that the next statement:
data.ages += 1
is more or less equivalent to:
a = data.ages a[:] = a + 1
i.e. a temporary is created (for keeping the result of 'a + 1') and then assigned to the 'ages' column. As it happens that, in this sort of operations, the memory copies are the bottleneck, the creation of the first temporary introduced a slowdown of 2x (due to the strided column) and the assignment represents the additional 2x (4x in total). However, the next idiom:
a = data.ages a += 1
effectively removes the need for the temporary copy and is 2x faster than the original "data.ages += 1". This can be seen in the next simple benchmark:
--------------------------- import numpy, timeit
count = 10e6 ages = numpy.random.randint(0,100,count) weights = numpy.random.randint(1,200,count) data = numpy.rec.fromarrays((ages,weights),names='ages,weights')
timer = timeit.Timer('data.ages += 1','from __main__ import data') print "v0-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a[:] = a + 1','from __main__ import data') print "v1-->", timer.timeit(number=10) timer = timeit.Timer('a=data.ages; a += 1','from __main__ import data') print "v2-->", timer.timeit(number=10) timer = timeit.Timer('ages += 1','from __main__ import ages') print "v3-->", timer.timeit(number=10) ---------------------------
which produces the next output on my laptop:
v0--> 2.98340201378 v1--> 3.22748112679 v2--> 1.5474319458 v3--> 0.809724807739
As a final comment, I suppose that unary operators (+=, -=...) can be optimized in the context of recarray columns in numpy, but I don't think it is worth the effort: when really high performance is needed for operating with columns in the context of recarrays, a column-wise approach is best.
Cheers,