How to implement a 'pivot table?'
Hi Everyone, I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?' I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below: Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60 I want to transform this into the following table, where f() is a given aggregate function: Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60) I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way? Thanks, Geoffrey
On 7/30/07, Geoffrey Zhu
Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
What's the brute force way? It's easier to offer an improved suggestion if we know what we're trying to beat. I want to transform an numpy array into a pivot table. For example, if
I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
I would use a python to dictionary to assemble lists of values. I would key off (region/date) tuples. In outline: map = {} dates = set() regions = set() for (region, date, units) in data: dates.add(date) regions.add(regions) key = (region, date) if key not in map: map[key] = [] map[key].append(data) Once you have map, regions and dates, you can trivially make a table as above. The details will depend on what format you want the table to have, but it should be easy to do. Thanks,
Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
Hi Timothy,
On 7/30/07, Timothy Hochberg
On 7/30/07, Geoffrey Zhu
wrote: Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
What's the brute force way? It's easier to offer an improved suggestion if we know what we're trying to beat.
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
I would use a python to dictionary to assemble lists of values. I would key off (region/date) tuples. In outline:
map = {} dates = set() regions = set() for (region, date, units) in data: dates.add(date) regions.add(regions) key = (region, date) if key not in map: map[key] = [] map[key].append(data)
Once you have map, regions and dates, you can trivially make a table as above. The details will depend on what format you want the table to have, but it should be easy to do.
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data. The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this. The other thing I am not sure is in your map object above, do I append the row number to the numpy array or do I append the row object (such as data[r])? Thanks, Geoffrey
On 7/30/07, Geoffrey Zhu
Hi Timothy,
On 7/30/07, Timothy Hochberg
wrote: On 7/30/07, Geoffrey Zhu
wrote: Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
What's the brute force way? It's easier to offer an improved suggestion
if
we know what we're trying to beat.
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
I would use a python to dictionary to assemble lists of values. I would key off (region/date) tuples. In outline:
map = {} dates = set() regions = set() for (region, date, units) in data: dates.add(date) regions.add(regions) key = (region, date) if key not in map: map[key] = [] map[key].append(data)
Once you have map, regions and dates, you can trivially make a table as above. The details will depend on what format you want the table to have, but it should be easy to do.
[SNIP] The 'brute-force' way is basically what you suggested -- looping
through all the records and building a two-way hash-table of the data.
The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this.
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors. There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though. -tim The other thing I am not sure is in your map object above, do I append
the row number to the numpy array or do I append the row object (such as data[r])?
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
Hi Timothy,
On 7/31/07, Timothy Hochberg
[SNIP]
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data.
The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this.
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors.
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost. I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this. I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though.
Unfortunately it has to be something generic. Thanks a lot for your help. Geoffrey
Generating these types of summary statistics is very common in SAS. In
SAS you would set up a sequence of procedures. First sort by the
variables of interest and then calculate the metrics of interest by
the combination of values. In numpy/scipy this might be something
like:
1. Sort by date and region
2. Determine 1st and last index of the blocks
3. Calculate mean,sum, etc. for each of the blocks.
Based on the earlier arguments I am wondering, however, if this would
provide any speed up. I am very interested in this issue so if you
implement some general procedures and perhaps speed tests please share
them with the list.
Best,
Vincent
On Jul 31, 10:00 am, "Geoffrey Zhu"
Hi Timothy,
On 7/31/07, Timothy Hochberg
wrote: [SNIP]
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data.
The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this.
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors.
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost.
I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this.
I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though.
Unfortunately it has to be something generic.
Thanks a lot for your help. Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discuss...@scipy.orghttp://projects.scipy.org/mailman/listinfo/numpy-discussion
I do a lot of this kind of things in SAS. In don't like SAS that much so it would be great to have functionality like this for numpy recarray's. To transplant the approach that SAS takes to a numpy setting you'd have something like the following 4 steps: 1. Sort the data by date and region 2. Determine the indices for the blocks (e.g., East, 1/1) 3. calculate the summary stats per block SAS is very efficient at these types of operations i believe. Since it assumes that the data is sorted, and throws and error if the data is not sorted appropriately, i assume the indexing can be more efficient. However, given the earlier comments i am wonder if this approach would enhance performance. I would be very interested to see what you come up with so please post some of the code and/or timing tests to the list if possible. Best, Vincent
Hi,
The hard part is knowing what aggregate function that you want. So a
hard way, even after cheating, to take the data provided is given
below. (The Numpy Example List was very useful especially on the where
function)!
I tried to be a little generic so you can replace the sum by any
suitable function and probably the array type as well. Of course it is
not complete because you still need to know the levels of the 'rows'
and 'columns' and also is not efficient as it has loops.
Bruce
from numpy import *
A=array([[1,1,10],
[1,1,20],
[1,2,30],
[2,1,40],
[2,2,50],
[2,2,60] ])
C = zeros((2,2))
for i in range(2):
crit1 = (A[:,0]==1+i)
subA=A[crit1,1:]
for j in range(2):
crit2 = (subA[:,0]==1+j)
subB=subA[crit2,1:]
C[i,j]=subB.sum()
print C
On 7/30/07, Geoffrey Zhu
Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Greetings,
Speaking of brute force... I've attached a rather ugly module that
let's you do things with a pretty simple interface (session shown
below). I haven't fully tested the performance, but a million
records with 5 fields takes about 11 seconds on my Mac to do a
'mean'. I'm not sure what your performance considerations are, but
this may be useful. Record arrays are really nice if they make sense
for your data.
Travis
(from an ipython command prompt)
In [1]: import testpivot as p
In [2]: a = p.sample_data()
In [3]: a
Out[3]:
recarray([('ACorp', 'Region 1', 'Q1', 20000.0),
('ACorp', 'Region 1', 'Q2', 22000.0),
('ACorp', 'Region 1', 'Q3', 21000.0),
('ACorp', 'Region 1', 'Q4', 26000.0),
('ACorp', 'Region 2', 'Q1', 23000.0),
('ACorp', 'Region 2', 'Q2', 20000.0),
('ACorp', 'Region 2', 'Q3', 22000.0),
('ACorp', 'Region 2', 'Q4', 21000.0),
('ACorp', 'Region 3', 'Q1', 26000.0),
('ACorp', 'Region 3', 'Q2', 23000.0),
('ACorp', 'Region 3', 'Q3', 29000.0),
('ACorp', 'Region 3', 'Q4', 27000.0),
('BCorp', 'Region 1', 'Q1', 20000.0),
('BCorp', 'Region 1', 'Q2', 20000.0),
('BCorp', 'Region 1', 'Q3', 24000.0),
('BCorp', 'Region 1', 'Q4', 24000.0),
('BCorp', 'Region 2', 'Q1', 21000.0),
('BCorp', 'Region 2', 'Q2', 21000.0),
('BCorp', 'Region 2', 'Q3', 22000.0),
('BCorp', 'Region 2', 'Q4', 29000.0),
('BCorp', 'Region 3', 'Q1', 28000.0),
('BCorp', 'Region 3', 'Q2', 25000.0),
('BCorp', 'Region 3', 'Q3', 22000.0),
('BCorp', 'Region 3', 'Q4', 21000.0)],
dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '|
S2'), ('income', ' Hi,
The hard part is knowing what aggregate function that you want. So a
hard way, even after cheating, to take the data provided is given
below. (The Numpy Example List was very useful especially on the where
function)! I tried to be a little generic so you can replace the sum by any
suitable function and probably the array type as well. Of course it is
not complete because you still need to know the levels of the 'rows'
and 'columns' and also is not efficient as it has loops. Bruce from numpy import *
A=array([[1,1,10],
[1,1,20],
[1,2,30],
[2,1,40],
[2,2,50],
[2,2,60] ])
C = zeros((2,2)) for i in range(2):
crit1 = (A[:,0]==1+i)
subA=A[crit1,1:]
for j in range(2):
crit2 = (subA[:,0]==1+j)
subB=subA[crit2,1:]
C[i,j]=subB.sum() print C On 7/30/07, Geoffrey Zhu Hi Everyone, I am wondering what is the best (and fast) way to build a pivot table
aside from the 'brute force way?' I want to transform an numpy array into a pivot table. For
example, if
I have a numpy array like below: Region Date # of Units
---------- ---------- --------------
East 1/1 10
East 1/1 20
East 1/2 30
West 1/1 40
West 1/2 50
West 1/2 60 I want to transform this into the following table, where f() is a
given aggregate function: Date
Region 1/1 1/2
----------
East f(10,20) f(30)
West f(40) f(50,60) I can regroup them into 'sets' and do it the brute force way, but
that
is kind of slow to execute. Does anyone know a better way? Thanks,
Geoffrey
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
What is ugly about the module? I like it!
What do you mean about recarray's? Do you think they are they not
appropriate for this type of thing?
When i get some time i'll run some tests versus SAS for the same
operations and do a speed comparison.
Question: Would there be an easy way to merge the summary stats back
into the recarray?
Best,
Vincent
On Aug 1, 11:22 pm, Travis Vaught
Greetings,
Speaking of brute force... I've attached a rather ugly module that let's you do things with a pretty simple interface (session shown below). I haven't fully tested the performance, but a million records with 5 fields takes about 11 seconds on my Mac to do a 'mean'. I'm not sure what your performance considerations are, but this may be useful. Record arrays are really nice if they make sense for your data.
Travis
(from an ipython command prompt)
In [1]: import testpivot as p
In [2]: a = p.sample_data()
In [3]: a Out[3]: recarray([('ACorp', 'Region 1', 'Q1', 20000.0), ('ACorp', 'Region 1', 'Q2', 22000.0), ('ACorp', 'Region 1', 'Q3', 21000.0), ('ACorp', 'Region 1', 'Q4', 26000.0), ('ACorp', 'Region 2', 'Q1', 23000.0), ('ACorp', 'Region 2', 'Q2', 20000.0), ('ACorp', 'Region 2', 'Q3', 22000.0), ('ACorp', 'Region 2', 'Q4', 21000.0), ('ACorp', 'Region 3', 'Q1', 26000.0), ('ACorp', 'Region 3', 'Q2', 23000.0), ('ACorp', 'Region 3', 'Q3', 29000.0), ('ACorp', 'Region 3', 'Q4', 27000.0), ('BCorp', 'Region 1', 'Q1', 20000.0), ('BCorp', 'Region 1', 'Q2', 20000.0), ('BCorp', 'Region 1', 'Q3', 24000.0), ('BCorp', 'Region 1', 'Q4', 24000.0), ('BCorp', 'Region 2', 'Q1', 21000.0), ('BCorp', 'Region 2', 'Q2', 21000.0), ('BCorp', 'Region 2', 'Q3', 22000.0), ('BCorp', 'Region 2', 'Q4', 29000.0), ('BCorp', 'Region 3', 'Q1', 28000.0), ('BCorp', 'Region 3', 'Q2', 25000.0), ('BCorp', 'Region 3', 'Q3', 22000.0), ('BCorp', 'Region 3', 'Q4', 21000.0)], dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '| S2'), ('income', '
In [4]: p.pivot(a, 'company', 'region', 'income', p.psum) ######## Summary by company and region ########## cols:['ACorp' 'BCorp'] rows:['Region 1' 'Region 2' 'Region 3'] [[ 89000. 88000.] [ 86000. 93000.] [ 105000. 96000.]]
In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 69000. 69000.] [ 65000. 66000.] [ 72000. 68000.] [ 74000. 74000.]]
In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 23000. 23000. ] [ 21666.66666667 22000. ] [ 24000. 22666.66666667] [ 24666.66666667 24666.66666667]]
testpivot.py 3KDownload
On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
Hi, The hard part is knowing what aggregate function that you want. So a hard way, even after cheating, to take the data provided is given below. (The Numpy Example List was very useful especially on the where function)!
I tried to be a little generic so you can replace the sum by any suitable function and probably the array type as well. Of course it is not complete because you still need to know the levels of the 'rows' and 'columns' and also is not efficient as it has loops.
Bruce
from numpy import * A=array([[1,1,10], [1,1,20], [1,2,30], [2,1,40], [2,2,50], [2,2,60] ]) C = zeros((2,2))
for i in range(2): crit1 = (A[:,0]==1+i) subA=A[crit1,1:] for j in range(2): crit2 = (subA[:,0]==1+j) subB=subA[crit2,1:] C[i,j]=subB.sum()
print C
On 7/30/07, Geoffrey Zhu
wrote: Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discuss...@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discuss...@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discuss...@scipy.orghttp://projects.scipy.org/mailman/listinfo/numpy-discussion
Nicely done Travis. Working code is always better than theory. I copied your
interface and used the brute-force, non-numpy approach to construct the
pivot table. On the one hand, it doesn't preserve the order that the entires
are discovered in as the original does. On the other hand, it's about 40%
faster for large files on my machine (see pivot2). Probably because you
don't have to loop through the data so many times. You can get further
improvements if you know the operation in advance as shown in pivotsum,
although this won't work on median ASAIK.
regards,
-tim
On 8/1/07, Travis Vaught
Greetings,
Speaking of brute force... I've attached a rather ugly module that let's you do things with a pretty simple interface (session shown below). I haven't fully tested the performance, but a million records with 5 fields takes about 11 seconds on my Mac to do a 'mean'. I'm not sure what your performance considerations are, but this may be useful. Record arrays are really nice if they make sense for your data.
Travis
(from an ipython command prompt)
In [1]: import testpivot as p
In [2]: a = p.sample_data()
In [3]: a Out[3]: recarray([('ACorp', 'Region 1', 'Q1', 20000.0), ('ACorp', 'Region 1', 'Q2', 22000.0), ('ACorp', 'Region 1', 'Q3', 21000.0), ('ACorp', 'Region 1', 'Q4', 26000.0 ), ('ACorp', 'Region 2', 'Q1', 23000.0), ('ACorp', 'Region 2', 'Q2', 20000.0), ('ACorp', 'Region 2', 'Q3', 22000.0), ('ACorp', 'Region 2', 'Q4', 21000.0), ('ACorp', 'Region 3', 'Q1', 26000.0), ('ACorp', 'Region 3', 'Q2', 23000.0), ('ACorp', 'Region 3', 'Q3', 29000.0), ('ACorp', 'Region 3', 'Q4', 27000.0), ('BCorp', 'Region 1', 'Q1', 20000.0), ('BCorp', 'Region 1', 'Q2', 20000.0), ('BCorp', 'Region 1', 'Q3', 24000.0), ('BCorp', 'Region 1', 'Q4', 24000.0), ('BCorp', 'Region 2', 'Q1', 21000.0 ), ('BCorp', 'Region 2', 'Q2', 21000.0), ('BCorp', 'Region 2', 'Q3', 22000.0), ('BCorp', 'Region 2', 'Q4', 29000.0), ('BCorp', 'Region 3', 'Q1', 28000.0), ('BCorp', 'Region 3', 'Q2', 25000.0), ('BCorp', 'Region 3', 'Q3', 22000.0), ('BCorp', 'Region 3', 'Q4', 21000.0)], dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '| S2'), ('income', '
In [4]: p.pivot(a, 'company', 'region', 'income', p.psum) ######## Summary by company and region ########## cols:['ACorp' 'BCorp'] rows:['Region 1' 'Region 2' 'Region 3'] [[ 89000. 88000.] [ 86000. 93000.] [ 105000. 96000.]]
In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 69000. 69000.] [ 65000. 66000.] [ 72000. 68000.] [ 74000. 74000.]]
In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 23000. 23000. ] [ 21666.66666667 22000. ] [ 24000. 22666.66666667] [ 24666.66666667 24666.66666667]]
On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
Hi, The hard part is knowing what aggregate function that you want. So a hard way, even after cheating, to take the data provided is given below. (The Numpy Example List was very useful especially on the where function)!
I tried to be a little generic so you can replace the sum by any suitable function and probably the array type as well. Of course it is not complete because you still need to know the levels of the 'rows' and 'columns' and also is not efficient as it has loops.
Bruce
from numpy import * A=array([[1,1,10], [1,1,20], [1,2,30], [2,1,40], [2,2,50], [2,2,60] ]) C = zeros((2,2))
for i in range(2): crit1 = (A[:,0]==1+i) subA=A[crit1,1:] for j in range(2): crit2 = (subA[:,0]==1+j) subB=subA[crit2,1:] C[i,j]=subB.sum()
print C
On 7/30/07, Geoffrey Zhu
wrote: Hi Everyone,
I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?'
I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below:
Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60
I want to transform this into the following table, where f() is a given aggregate function:
Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60)
I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way?
Thanks, Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
participants (5)
-
Bruce Southey
-
Geoffrey Zhu
-
Timothy Hochberg
-
Travis Vaught
-
Vincent