Re: [Numpy-discussion] Can I add rows and columns to recarray?

6 Dec 2010

      A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué:
...
Thank you for the quick response and Christopher's explanation on the
design background.
All my tables fit in-memory. I want to explore the data interactively
and relational database is does not provide me a lot of value.
I was rolling my own library before I come to numpy. Then I find
numpy's universal function awesome and really fit what I want to do.
Now I just need to find out what to add row which is easy in Python.
It is OK if it rebuild an array when I add a column, which should
happen infrequently. But if adding row build a new array, this will
lead to O(n^2) complexity. In anycase, I will explore the
recfunctions.
If you want a container with a better complexity for adding columns  
than O(n^2), you may want to have a look at the ctable object in carray 
package:

https://github.com/FrancescAlted/carray

carray is about providing compressed, in-memory data containers for both 
homogeneous (arrays) and heterogeneous data (structured arrays).  Here 
it is an example of use:
...
...
...
import numpy as np
import carray as ca
NR = 1000*1000
r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
new_field = np.arange(NR, dtype='f8')**3
rc = ca.ctable(r)
rc
ctable((1000000,), [('f0', '
that is, only 30 ms for appending a column.  This is basically the time 
to copy (and compress) the data (i.e. O(n)).  If you append an already 
compressed column, the cost of adding it is O(1):
...
...
...
r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
rc = ca.ctable(r)
cnew_field = ca.carray(np.arange(NR, dtype='f8')**3)
time rc.addcol(cnew_field, "f2")
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
On his hand, using plain structured arrays is pretty more costly:
...
...
...
import numpy.lib.recfunctions as nprf
time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8')
CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s
Wall time: 0.36 s
Appending data at the end of ctable objects is also very fast:
...
...
...
timeit rc.append(row)
100000 loops, best of 3: 13.1 µs per loop
Compare this with an append with an structured array:
...
...
...
timeit np.concatenate((r2, row))
100 loops, best of 3: 6.84 ms per loop
Unfortunately you cannot do the full range of operations supported by 
structured arrays with ctables, and a ctable object is rather meant to 
be used as an efficient, compressed container for structures in memory:
...
...
...
r2[2]
(2, 4, 8.0)
rc[2]
(2, 4, 8.0)
r2['f1']
array([0, 1, 4, ..., 1, 1, 1])
rc['f1']
carray((1452223,), int64)  nbytes: 11.08 MB; cbytes: 1.62 MB; ratio: 
6.85
  cparams := cparams(clevel=5, shuffle=True)
[0, 1, 4, ..., 1, 1, 1]
But still, you can do funny things like complex queries:
...
...
...
[r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])]
[(2, 4),
 (3, 9),
 (4, 16),
 (5, 25),
 (6, 36),
 (7, 49),
 (8, 64),
 (9, 81),
 (1041112, 1)]
The queries are also very fast (both Numexpr and Blosc are used under 
the hood):
...
...
...
timeit [r for r in rc.getif("(f0<10)&(f2>4)")]
10 loops, best of 3: 58.6 ms per loop
timeit r2[(r2['f0']<10)&(r2['f2']>4)]
10 loops, best of 3: 28 ms per loop
So, queries on ctables are only 2x slower than using plain structured 
arrays  --of course, the secret goal is to make these sort of queries 
actually faster than using structured arrays :)

I still need to finish the docs, but I plan to release carray 0.3 later 
this week.

Cheers,

-- 
Francesc Alted