Nicely done Travis. Working code is always better than theory. I copied your
interface and used the brute-force, non-numpy approach to construct the
pivot table. On the one hand, it doesn't preserve the order that the entires
are discovered in as the original does. On the other hand, it's about 40%
faster for large files on my machine (see pivot2). Probably because you
don't have to loop through the data so many times. You can get further
improvements if you know the operation in advance as shown in pivotsum,
although this won't work on median ASAIK.
regards,
-tim
On 8/1/07, Travis Vaught
Greetings,
Speaking of brute force... I've attached a rather ugly module that let's you do things with a pretty simple interface (session shown below). I haven't fully tested the performance, but a million records with 5 fields takes about 11 seconds on my Mac to do a 'mean'. I'm not sure what your performance considerations are, but this may be useful. Record arrays are really nice if they make sense for your data.
Travis
(from an ipython command prompt)
In [1]: import testpivot as p
In [2]: a = p.sample_data()
In [3]: a Out[3]: recarray([('ACorp', 'Region 1', 'Q1', 20000.0), ('ACorp', 'Region 1', 'Q2', 22000.0), ('ACorp', 'Region 1', 'Q3', 21000.0), ('ACorp', 'Region 1', 'Q4', 26000.0 ), ('ACorp', 'Region 2', 'Q1', 23000.0), ('ACorp', 'Region 2', 'Q2', 20000.0), ('ACorp', 'Region 2', 'Q3', 22000.0), ('ACorp', 'Region 2', 'Q4', 21000.0), ('ACorp', 'Region 3', 'Q1', 26000.0), ('ACorp', 'Region 3', 'Q2', 23000.0), ('ACorp', 'Region 3', 'Q3', 29000.0), ('ACorp', 'Region 3', 'Q4', 27000.0), ('BCorp', 'Region 1', 'Q1', 20000.0), ('BCorp', 'Region 1', 'Q2', 20000.0), ('BCorp', 'Region 1', 'Q3', 24000.0), ('BCorp', 'Region 1', 'Q4', 24000.0), ('BCorp', 'Region 2', 'Q1', 21000.0 ), ('BCorp', 'Region 2', 'Q2', 21000.0), ('BCorp', 'Region 2', 'Q3', 22000.0), ('BCorp', 'Region 2', 'Q4', 29000.0), ('BCorp', 'Region 3', 'Q1', 28000.0), ('BCorp', 'Region 3', 'Q2', 25000.0), ('BCorp', 'Region 3', 'Q3', 22000.0), ('BCorp', 'Region 3', 'Q4', 21000.0)], dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '| S2'), ('income', '
In [4]: p.pivot(a, 'company', 'region', 'income', p.psum) ######## Summary by company and region ########## cols:['ACorp' 'BCorp'] rows:['Region 1' 'Region 2' 'Region 3'] [[ 89000. 88000.] [ 86000. 93000.] [ 105000. 96000.]]
In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 69000. 69000.] [ 65000. 66000.] [ 72000. 68000.] [ 74000. 74000.]]
In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 23000. 23000. ] [ 21666.66666667 22000. ] [ 24000. 22666.66666667] [ 24666.66666667 24666.66666667]]
On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
Hi, The hard part is knowing what aggregate function that you want. So a hard way, even after cheating, to take the data provided is given below. (The Numpy Example List was very useful especially on the where function)!
I tried to be a little generic so you can replace the sum by any suitable function and probably the array type as well. Of course it is not complete because you still need to know the levels of the 'rows' and 'columns' and also is not efficient as it has loops.
Bruce
from numpy import * A=array([[1,1,10], [1,1,20], [1,2,30], [2,1,40], [2,2,50], [2,2,60] ]) C = zeros((2,2))
for i in range(2): crit1 = (A[:,0]==1+i) subA=A[crit1,1:] for j in range(2): crit2 = (subA[:,0]==1+j) subB=subA[crit2,1:] C[i,j]=subB.sum()
print C
On 7/30/07, Geoffrey Zhu
wrote: Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org