A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué:
Thank you for the quick response and Christopher's explanation on the design background.
All my tables fit in-memory. I want to explore the data interactively and relational database is does not provide me a lot of value.
I was rolling my own library before I come to numpy. Then I find numpy's universal function awesome and really fit what I want to do. Now I just need to find out what to add row which is easy in Python. It is OK if it rebuild an array when I add a column, which should happen infrequently. But if adding row build a new array, this will lead to O(n^2) complexity. In anycase, I will explore the recfunctions.
If you want a container with a better complexity for adding columns than O(n^2), you may want to have a look at the ctable object in carray package: https://github.com/FrancescAlted/carray carray is about providing compressed, in-memory data containers for both homogeneous (arrays) and heterogeneous data (structured arrays). Here it is an example of use:
import numpy as np import carray as ca NR = 1000*1000 r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") new_field = np.arange(NR, dtype='f8')**3 rc = ca.ctable(r) rc ctable((1000000,), [('f0', '
that is, only 30 ms for appending a column. This is basically the time to copy (and compress) the data (i.e. O(n)). If you append an already compressed column, the cost of adding it is O(1):
r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") rc = ca.ctable(r) cnew_field = ca.carray(np.arange(NR, dtype='f8')**3) time rc.addcol(cnew_field, "f2") CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s
On his hand, using plain structured arrays is pretty more costly:
import numpy.lib.recfunctions as nprf time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8') CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s Wall time: 0.36 s
Appending data at the end of ctable objects is also very fast:
timeit rc.append(row) 100000 loops, best of 3: 13.1 µs per loop
Compare this with an append with an structured array:
timeit np.concatenate((r2, row)) 100 loops, best of 3: 6.84 ms per loop
Unfortunately you cannot do the full range of operations supported by structured arrays with ctables, and a ctable object is rather meant to be used as an efficient, compressed container for structures in memory:
r2[2] (2, 4, 8.0) rc[2] (2, 4, 8.0) r2['f1'] array([0, 1, 4, ..., 1, 1, 1]) rc['f1'] carray((1452223,), int64) nbytes: 11.08 MB; cbytes: 1.62 MB; ratio: 6.85 cparams := cparams(clevel=5, shuffle=True) [0, 1, 4, ..., 1, 1, 1]
But still, you can do funny things like complex queries:
[r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])] [(2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (1041112, 1)]
The queries are also very fast (both Numexpr and Blosc are used under the hood):
timeit [r for r in rc.getif("(f0<10)&(f2>4)")] 10 loops, best of 3: 58.6 ms per loop timeit r2[(r2['f0']<10)&(r2['f2']>4)] 10 loops, best of 3: 28 ms per loop
So, queries on ctables are only 2x slower than using plain structured arrays --of course, the secret goal is to make these sort of queries actually faster than using structured arrays :) I still need to finish the docs, but I plan to release carray 0.3 later this week. Cheers, -- Francesc Alted