Can I add rows and columns to recarray?
I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation. I have a recarray like this In [339]: arr = np.array([ .....: (1, 2.2, 0.0), .....: (3, 4.5, 0.0) .....: ], .....: dtype=[ .....: ('unit',int), .....: ('price',float), .....: ('amount',float), .....: ] .....: ) In [340]: data = arr.view(recarray) One of the most common thing I want to do is to append rows to data. I think concatenate() might be the method. But I get a problem: In [342]: np.concatenate((data0,[1,9.0,9.0])) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) c:\Python26\Lib\site-packages\numpy\<ipython console> in <module>() TypeError: expected a readable buffer object The other thing I want to do is to calculate the column value. Right now it can do great thing like In [343]: data.amount = data.unit * data.price But sometimes it may require me to add a new column not already exist, e.g.: In [344]: data.discount_price = data.price * 0.9 How can I add a new column? I tried column_stack. But it give a similar TypeError. I figure I need to first specify the type of the column. But I don't know how. Thanks, Wai Yip
On Sun, Dec 5, 2010 at 10:56 PM, Wai Yip Tung <tungwaiyip@yahoo.com> wrote:
I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation. I have a recarray like this
In [339]: arr = np.array([ .....: (1, 2.2, 0.0), .....: (3, 4.5, 0.0) .....: ], .....: dtype=[ .....: ('unit',int), .....: ('price',float), .....: ('amount',float), .....: ] .....: )
In [340]: data = arr.view(recarray)
One of the most common thing I want to do is to append rows to data. I think concatenate() might be the method. But I get a problem:
In [342]: np.concatenate((data0,[1,9.0,9.0])) --------------------------------------------------------------------------- TypeError Traceback (most recent call last)
c:\Python26\Lib\site-packages\numpy\<ipython console> in <module>()
TypeError: expected a readable buffer object
The other thing I want to do is to calculate the column value. Right now it can do great thing like
In [343]: data.amount = data.unit * data.price
But sometimes it may require me to add a new column not already exist, e.g.:
In [344]: data.discount_price = data.price * 0.9
How can I add a new column? I tried column_stack. But it give a similar TypeError. I figure I need to first specify the type of the column. But I don't know how.
Check out numpy.lib.recfunctions I often have import numpy.lib.recfunctions as nprf Skipper
Thank you for the quick response and Christopher's explanation on the design background. All my tables fit in-memory. I want to explore the data interactively and relational database is does not provide me a lot of value. I was rolling my own library before I come to numpy. Then I find numpy's universal function awesome and really fit what I want to do. Now I just need to find out what to add row which is easy in Python. It is OK if it rebuild an array when I add a column, which should happen infrequently. But if adding row build a new array, this will lead to O(n^2) complexity. In anycase, I will explore the recfunctions. Thank you Wai Yip
On Sun, Dec 5, 2010 at 10:56 PM, Wai Yip Tung <tungwaiyip@yahoo.com> wrote:
I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation. I have a recarray like this
In [339]: arr = np.array([ .....: (1, 2.2, 0.0), .....: (3, 4.5, 0.0) .....: ], .....: dtype=[ .....: ('unit',int), .....: ('price',float), .....: ('amount',float), .....: ] .....: )
In [340]: data = arr.view(recarray)
One of the most common thing I want to do is to append rows to data. I think concatenate() might be the method. But I get a problem:
In [342]: np.concatenate((data0,[1,9.0,9.0])) --------------------------------------------------------------------------- TypeError Traceback (most recent call last)
c:\Python26\Lib\site-packages\numpy\<ipython console> in <module>()
TypeError: expected a readable buffer object
The other thing I want to do is to calculate the column value. Right now it can do great thing like
In [343]: data.amount = data.unit * data.price
But sometimes it may require me to add a new column not already exist, e.g.:
In [344]: data.discount_price = data.price * 0.9
How can I add a new column? I tried column_stack. But it give a similar TypeError. I figure I need to first specify the type of the column. But I don't know how.
Check out numpy.lib.recfunctions
I often have
import numpy.lib.recfunctions as nprf
Skipper
On 12/6/10 1:00 PM, Wai Yip Tung wrote:
Thank you for the quick response and Christopher's explanation on the design background.
you're welcome.
But if adding row build a new array, this will lead to O(n^2) complexity.
if you are adding a lot of rows one at a time, yes, you can have performance issues -- though re-allocating data is pretty fast, too -- maybe it won't matter. If it does, consider the accumulator code I sent, or use it as inspiration to write your own. If you do improve it, please send your improvements back to me. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué:
Thank you for the quick response and Christopher's explanation on the design background.
All my tables fit in-memory. I want to explore the data interactively and relational database is does not provide me a lot of value.
I was rolling my own library before I come to numpy. Then I find numpy's universal function awesome and really fit what I want to do. Now I just need to find out what to add row which is easy in Python. It is OK if it rebuild an array when I add a column, which should happen infrequently. But if adding row build a new array, this will lead to O(n^2) complexity. In anycase, I will explore the recfunctions.
If you want a container with a better complexity for adding columns than O(n^2), you may want to have a look at the ctable object in carray package: https://github.com/FrancescAlted/carray carray is about providing compressed, in-memory data containers for both homogeneous (arrays) and heterogeneous data (structured arrays). Here it is an example of use:
import numpy as np import carray as ca NR = 1000*1000 r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") new_field = np.arange(NR, dtype='f8')**3 rc = ca.ctable(r) rc ctable((1000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 11.44 MB; cbytes: 1.71 MB; ratio: 6.70 [(0, 0), (1, 1), (2, 4), ..., (999997, 999994000009), (999998, 999996000004), (999999, 999998000001)] time rc.addcol(new_field, "f2") CPU times: user 0.03 s, sys: 0.00 s, total: 0.03 s Wall time: 0.03 s
that is, only 30 ms for appending a column. This is basically the time to copy (and compress) the data (i.e. O(n)). If you append an already compressed column, the cost of adding it is O(1):
r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") rc = ca.ctable(r) cnew_field = ca.carray(np.arange(NR, dtype='f8')**3) time rc.addcol(cnew_field, "f2") CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s
On his hand, using plain structured arrays is pretty more costly:
import numpy.lib.recfunctions as nprf time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8') CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s Wall time: 0.36 s
Appending data at the end of ctable objects is also very fast:
timeit rc.append(row) 100000 loops, best of 3: 13.1 µs per loop
Compare this with an append with an structured array:
timeit np.concatenate((r2, row)) 100 loops, best of 3: 6.84 ms per loop
Unfortunately you cannot do the full range of operations supported by structured arrays with ctables, and a ctable object is rather meant to be used as an efficient, compressed container for structures in memory:
r2[2] (2, 4, 8.0) rc[2] (2, 4, 8.0) r2['f1'] array([0, 1, 4, ..., 1, 1, 1]) rc['f1'] carray((1452223,), int64) nbytes: 11.08 MB; cbytes: 1.62 MB; ratio: 6.85 cparams := cparams(clevel=5, shuffle=True) [0, 1, 4, ..., 1, 1, 1]
But still, you can do funny things like complex queries:
[r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])] [(2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (1041112, 1)]
The queries are also very fast (both Numexpr and Blosc are used under the hood):
timeit [r for r in rc.getif("(f0<10)&(f2>4)")] 10 loops, best of 3: 58.6 ms per loop timeit r2[(r2['f0']<10)&(r2['f2']>4)] 10 loops, best of 3: 28 ms per loop
So, queries on ctables are only 2x slower than using plain structured arrays --of course, the secret goal is to make these sort of queries actually faster than using structured arrays :) I still need to finish the docs, but I plan to release carray 0.3 later this week. Cheers, -- Francesc Alted
On 12/5/10 7:56 PM, Wai Yip Tung wrote:
I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation.
note that recarrays (or structured arrays, AFAIK, the difference is atturube access only -- I don't use recarrays) are far more static than a database table. So you may really want to use a database, or maybe pytables. Or maybe even just stick with lists. But if you are keeping things in memory, should be able to do what you want.
In [339]: arr = np.array([ .....: (1, 2.2, 0.0), .....: (3, 4.5, 0.0) .....: ], .....: dtype=[ .....: ('unit',int), .....: ('price',float), .....: ('amount',float), .....: ] .....: )
In [340]: data = arr.view(recarray)
One of the most common thing I want to do is to append rows to data.
numpy arrays do not naturally support appending, as you have discovered.
I think concatenate() might be the method.
yes.
But I get a problem:
In [342]: np.concatenate((data0,[1,9.0,9.0])) --------------------------------------------------------------------------- TypeError Traceback (most recent call last)
c:\Python26\Lib\site-packages\numpy\<ipython console> in<module>()
TypeError: expected a readable buffer object
concatenate expects two arrays to be joined. If you pass in something that can easily be turned into an array, it will work, but a tuple can be converted to multiple types of arrays, so it doesn't know what to do. So you need to re-construct the second array: a2 = np.array( [(3,5.5, 3)], dtype=dt) arr = np.concatenate( (arr, a2) )
In [343]: data.amount = data.unit * data.price
yup
But sometimes it may require me to add a new column not already exist, e.g.:
In [344]: data.discount_price = data.price * 0.9
How can I add a new column?
you can't. what you need to do is create a new array with a new dtype that includes the new field. The trick is that numpy only supports homogenous arrays -- evey item is the same data type. So when you could a strut array like above, numpy does not define it as a 2-d table, but rather, a 1-d array, each element of which is a structure. so you need to do something like: # create a new array data2 = np.zeros(len(data), dtype=dt2) # fill the array: for field_name in dt.fields.keys(): data2[field_name] = data[field_name] # now some calculations: data2['discount_price'] = data2['price'] * 0.9 I don't know of a way to avoid that loop when filling the array. Better yet -- anticipate your needs and create the array with all the fields you need in the first place. You can see that ndarrays are pretty static -- struct arrays can be useful data storage, but are not very suitable when things are changing much. You could write a class that wraps an andarray, and supports what you need better -- it could be a pretty usefull general purpose class, too. I've got one that handle the appending part, but nothing with adding new fields. Here's appending with my class: data3 = accumulator.accumulator(dtype = dt2) data3.append((1, 2.2, 0.0, 0.0)) data3.append((3, 4.5, 0.0, 0.0)) data3.append((2, 1.2, 0.0, 0.0)) data3.append((5, 4.2, 0.0, 0.0)) print repr(data3) # convert to regular array for calculations: data3 = np.array(data3) # now some calculations: data3['discount_price'] = data3['price'] * 0.9 You wouldn't have to convert to a regular array, except that I haven't written the code to support field access yet -- I don't think it would be too hard, though. I've enclosed some test code, and my accumulator class, in case you find it useful. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Dec 6, 2010 at 12:26 PM, Christopher Barker <Chris.Barker@noaa.gov>wrote:
On 12/5/10 7:56 PM, Wai Yip Tung wrote:
I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation.
note that recarrays (or structured arrays, AFAIK, the difference is atturube access only -- I don't use recarrays) are far more static than a database table. So you may really want to use a database, or maybe pytables. Or maybe even just stick with lists.
But if you are keeping things in memory, should be able to do what you want.
In [339]: arr = np.array([
.....: (1, 2.2, 0.0), .....: (3, 4.5, 0.0) .....: ], .....: dtype=[ .....: ('unit',int), .....: ('price',float), .....: ('amount',float), .....: ] .....: )
In [340]: data = arr.view(recarray)
One of the most common thing I want to do is to append rows to data.
numpy arrays do not naturally support appending, as you have discovered.
I
think concatenate() might be the method.
yes.
But I get a problem:
In [342]: np.concatenate((data0,[1,9.0,9.0]))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last)
c:\Python26\Lib\site-packages\numpy\<ipython console> in<module>()
TypeError: expected a readable buffer object
concatenate expects two arrays to be joined. If you pass in something that can easily be turned into an array, it will work, but a tuple can be converted to multiple types of arrays, so it doesn't know what to do. So you need to re-construct the second array:
a2 = np.array( [(3,5.5, 3)], dtype=dt) arr = np.concatenate( (arr, a2) )
In [343]: data.amount = data.unit * data.price
yup
But sometimes it may require me to add a new column not already exist,
e.g.:
In [344]: data.discount_price = data.price * 0.9
How can I add a new column?
you can't. what you need to do is create a new array with a new dtype that includes the new field.
The trick is that numpy only supports homogenous arrays -- evey item is the same data type. So when you could a strut array like above, numpy does not define it as a 2-d table, but rather, a 1-d array, each element of which is a structure.
so you need to do something like:
# create a new array data2 = np.zeros(len(data), dtype=dt2)
# fill the array: for field_name in dt.fields.keys(): data2[field_name] = data[field_name]
# now some calculations: data2['discount_price'] = data2['price'] * 0.9
I don't know of a way to avoid that loop when filling the array.
Better yet -- anticipate your needs and create the array with all the fields you need in the first place.
You can see that ndarrays are pretty static -- struct arrays can be useful data storage, but are not very suitable when things are changing much.
You could write a class that wraps an andarray, and supports what you need better -- it could be a pretty usefull general purpose class, too. I've got one that handle the appending part, but nothing with adding new fields.
Here's appending with my class:
data3 = accumulator.accumulator(dtype = dt2) data3.append((1, 2.2, 0.0, 0.0)) data3.append((3, 4.5, 0.0, 0.0)) data3.append((2, 1.2, 0.0, 0.0)) data3.append((5, 4.2, 0.0, 0.0)) print repr(data3)
# convert to regular array for calculations: data3 = np.array(data3)
# now some calculations: data3['discount_price'] = data3['price'] * 0.9
You wouldn't have to convert to a regular array, except that I haven't written the code to support field access yet -- I don't think it would be too hard, though.
I've enclosed some test code, and my accumulator class, in case you find it useful.
-Chris
numpy.lib.recfunctions has a method for easily adding new columns. Of course, it really returns a new recarray rather than adding it to an existing recarray. Appending records to such an array, however is a different story, and you have to do something like you demonstrated above. Ben Root
On 12/6/10 11:00 AM, Benjamin Root wrote:
numpy.lib.recfunctions has a method for easily adding new columns.
cool! There is a lot of other nifty- looking stuff in there too. The OP should really take a look. And maybe an appending function is in order, too. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Benjamin Root
-
Christopher Barker
-
Francesc Alted
-
Skipper Seabold
-
Wai Yip Tung