How to implement a 'pivot table?'
![](https://secure.gravatar.com/avatar/11bc81e43c7f85543e5245030195455d.jpg?s=120&d=mm&r=g)
Hi Everyone, I am wondering what is the best (and fast) way to build a pivot table aside from the 'brute force way?' I want to transform an numpy array into a pivot table. For example, if I have a numpy array like below: Region Date # of Units ---------- ---------- -------------- East 1/1 10 East 1/1 20 East 1/2 30 West 1/1 40 West 1/2 50 West 1/2 60 I want to transform this into the following table, where f() is a given aggregate function: Date Region 1/1 1/2 ---------- East f(10,20) f(30) West f(40) f(50,60) I can regroup them into 'sets' and do it the brute force way, but that is kind of slow to execute. Does anyone know a better way? Thanks, Geoffrey
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
What's the brute force way? It's easier to offer an improved suggestion if we know what we're trying to beat. I want to transform an numpy array into a pivot table. For example, if
I would use a python to dictionary to assemble lists of values. I would key off (region/date) tuples. In outline: map = {} dates = set() regions = set() for (region, date, units) in data: dates.add(date) regions.add(regions) key = (region, date) if key not in map: map[key] = [] map[key].append(data) Once you have map, regions and dates, you can trivially make a table as above. The details will depend on what format you want the table to have, but it should be easy to do. Thanks,
-- . __ . |-\ . . tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/11bc81e43c7f85543e5245030195455d.jpg?s=120&d=mm&r=g)
Hi Timothy, On 7/30/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data. The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this. The other thing I am not sure is in your map object above, do I append the row number to the numpy array or do I append the row object (such as data[r])? Thanks, Geoffrey
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
[SNIP] The 'brute-force' way is basically what you suggested -- looping
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors. There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though. -tim The other thing I am not sure is in your map object above, do I append
-- . __ . |-\ . . tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/11bc81e43c7f85543e5245030195455d.jpg?s=120&d=mm&r=g)
Hi Timothy, On 7/31/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost. I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this. I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
Unfortunately it has to be something generic. Thanks a lot for your help. Geoffrey
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
Generating these types of summary statistics is very common in SAS. In SAS you would set up a sequence of procedures. First sort by the variables of interest and then calculate the metrics of interest by the combination of values. In numpy/scipy this might be something like: 1. Sort by date and region 2. Determine 1st and last index of the blocks 3. Calculate mean,sum, etc. for each of the blocks. Based on the earlier arguments I am wondering, however, if this would provide any speed up. I am very interested in this issue so if you implement some general procedures and perhaps speed tests please share them with the list. Best, Vincent On Jul 31, 10:00 am, "Geoffrey Zhu" <zyzhu2...@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
I do a lot of this kind of things in SAS. In don't like SAS that much so it would be great to have functionality like this for numpy recarray's. To transplant the approach that SAS takes to a numpy setting you'd have something like the following 4 steps: 1. Sort the data by date and region 2. Determine the indices for the blocks (e.g., East, 1/1) 3. calculate the summary stats per block SAS is very efficient at these types of operations i believe. Since it assumes that the data is sorted, and throws and error if the data is not sorted appropriately, i assume the indexing can be more efficient. However, given the earlier comments i am wonder if this approach would enhance performance. I would be very interested to see what you come up with so please post some of the code and/or timing tests to the list if possible. Best, Vincent
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, The hard part is knowing what aggregate function that you want. So a hard way, even after cheating, to take the data provided is given below. (The Numpy Example List was very useful especially on the where function)! I tried to be a little generic so you can replace the sum by any suitable function and probably the array type as well. Of course it is not complete because you still need to know the levels of the 'rows' and 'columns' and also is not efficient as it has loops. Bruce from numpy import * A=array([[1,1,10], [1,1,20], [1,2,30], [2,1,40], [2,2,50], [2,2,60] ]) C = zeros((2,2)) for i in range(2): crit1 = (A[:,0]==1+i) subA=A[crit1,1:] for j in range(2): crit2 = (subA[:,0]==1+j) subB=subA[crit2,1:] C[i,j]=subB.sum() print C On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/c212ff1e039305578db25f04a414d375.jpg?s=120&d=mm&r=g)
Greetings, Speaking of brute force... I've attached a rather ugly module that let's you do things with a pretty simple interface (session shown below). I haven't fully tested the performance, but a million records with 5 fields takes about 11 seconds on my Mac to do a 'mean'. I'm not sure what your performance considerations are, but this may be useful. Record arrays are really nice if they make sense for your data. Travis (from an ipython command prompt) In [1]: import testpivot as p In [2]: a = p.sample_data() In [3]: a Out[3]: recarray([('ACorp', 'Region 1', 'Q1', 20000.0), ('ACorp', 'Region 1', 'Q2', 22000.0), ('ACorp', 'Region 1', 'Q3', 21000.0), ('ACorp', 'Region 1', 'Q4', 26000.0), ('ACorp', 'Region 2', 'Q1', 23000.0), ('ACorp', 'Region 2', 'Q2', 20000.0), ('ACorp', 'Region 2', 'Q3', 22000.0), ('ACorp', 'Region 2', 'Q4', 21000.0), ('ACorp', 'Region 3', 'Q1', 26000.0), ('ACorp', 'Region 3', 'Q2', 23000.0), ('ACorp', 'Region 3', 'Q3', 29000.0), ('ACorp', 'Region 3', 'Q4', 27000.0), ('BCorp', 'Region 1', 'Q1', 20000.0), ('BCorp', 'Region 1', 'Q2', 20000.0), ('BCorp', 'Region 1', 'Q3', 24000.0), ('BCorp', 'Region 1', 'Q4', 24000.0), ('BCorp', 'Region 2', 'Q1', 21000.0), ('BCorp', 'Region 2', 'Q2', 21000.0), ('BCorp', 'Region 2', 'Q3', 22000.0), ('BCorp', 'Region 2', 'Q4', 29000.0), ('BCorp', 'Region 3', 'Q1', 28000.0), ('BCorp', 'Region 3', 'Q2', 25000.0), ('BCorp', 'Region 3', 'Q3', 22000.0), ('BCorp', 'Region 3', 'Q4', 21000.0)], dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '| S2'), ('income', '<f8')]) In [4]: p.pivot(a, 'company', 'region', 'income', p.psum) ######## Summary by company and region ########## cols:['ACorp' 'BCorp'] rows:['Region 1' 'Region 2' 'Region 3'] [[ 89000. 88000.] [ 86000. 93000.] [ 105000. 96000.]] In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 69000. 69000.] [ 65000. 66000.] [ 72000. 68000.] [ 74000. 74000.]] In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 23000. 23000. ] [ 21666.66666667 22000. ] [ 24000. 22666.66666667] [ 24666.66666667 24666.66666667]] On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
What is ugly about the module? I like it! What do you mean about recarray's? Do you think they are they not appropriate for this type of thing? When i get some time i'll run some tests versus SAS for the same operations and do a speed comparison. Question: Would there be an easy way to merge the summary stats back into the recarray? Best, Vincent On Aug 1, 11:22 pm, Travis Vaught <tra...@enthought.com> wrote:
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Nicely done Travis. Working code is always better than theory. I copied your interface and used the brute-force, non-numpy approach to construct the pivot table. On the one hand, it doesn't preserve the order that the entires are discovered in as the original does. On the other hand, it's about 40% faster for large files on my machine (see pivot2). Probably because you don't have to loop through the data so many times. You can get further improvements if you know the operation in advance as shown in pivotsum, although this won't work on median ASAIK. regards, -tim On 8/1/07, Travis Vaught <travis@enthought.com> wrote:
-- . __ . |-\ . . tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
What's the brute force way? It's easier to offer an improved suggestion if we know what we're trying to beat. I want to transform an numpy array into a pivot table. For example, if
I would use a python to dictionary to assemble lists of values. I would key off (region/date) tuples. In outline: map = {} dates = set() regions = set() for (region, date, units) in data: dates.add(date) regions.add(regions) key = (region, date) if key not in map: map[key] = [] map[key].append(data) Once you have map, regions and dates, you can trivially make a table as above. The details will depend on what format you want the table to have, but it should be easy to do. Thanks,
-- . __ . |-\ . . tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/11bc81e43c7f85543e5245030195455d.jpg?s=120&d=mm&r=g)
Hi Timothy, On 7/30/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data. The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this. The other thing I am not sure is in your map object above, do I append the row number to the numpy array or do I append the row object (such as data[r])? Thanks, Geoffrey
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
[SNIP] The 'brute-force' way is basically what you suggested -- looping
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors. There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though. -tim The other thing I am not sure is in your map object above, do I append
-- . __ . |-\ . . tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/11bc81e43c7f85543e5245030195455d.jpg?s=120&d=mm&r=g)
Hi Timothy, On 7/31/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost. I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this. I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
Unfortunately it has to be something generic. Thanks a lot for your help. Geoffrey
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
Generating these types of summary statistics is very common in SAS. In SAS you would set up a sequence of procedures. First sort by the variables of interest and then calculate the metrics of interest by the combination of values. In numpy/scipy this might be something like: 1. Sort by date and region 2. Determine 1st and last index of the blocks 3. Calculate mean,sum, etc. for each of the blocks. Based on the earlier arguments I am wondering, however, if this would provide any speed up. I am very interested in this issue so if you implement some general procedures and perhaps speed tests please share them with the list. Best, Vincent On Jul 31, 10:00 am, "Geoffrey Zhu" <zyzhu2...@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
I do a lot of this kind of things in SAS. In don't like SAS that much so it would be great to have functionality like this for numpy recarray's. To transplant the approach that SAS takes to a numpy setting you'd have something like the following 4 steps: 1. Sort the data by date and region 2. Determine the indices for the blocks (e.g., East, 1/1) 3. calculate the summary stats per block SAS is very efficient at these types of operations i believe. Since it assumes that the data is sorted, and throws and error if the data is not sorted appropriately, i assume the indexing can be more efficient. However, given the earlier comments i am wonder if this approach would enhance performance. I would be very interested to see what you come up with so please post some of the code and/or timing tests to the list if possible. Best, Vincent
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, The hard part is knowing what aggregate function that you want. So a hard way, even after cheating, to take the data provided is given below. (The Numpy Example List was very useful especially on the where function)! I tried to be a little generic so you can replace the sum by any suitable function and probably the array type as well. Of course it is not complete because you still need to know the levels of the 'rows' and 'columns' and also is not efficient as it has loops. Bruce from numpy import * A=array([[1,1,10], [1,1,20], [1,2,30], [2,1,40], [2,2,50], [2,2,60] ]) C = zeros((2,2)) for i in range(2): crit1 = (A[:,0]==1+i) subA=A[crit1,1:] for j in range(2): crit2 = (subA[:,0]==1+j) subB=subA[crit2,1:] C[i,j]=subB.sum() print C On 7/30/07, Geoffrey Zhu <zyzhu2000@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/c212ff1e039305578db25f04a414d375.jpg?s=120&d=mm&r=g)
Greetings, Speaking of brute force... I've attached a rather ugly module that let's you do things with a pretty simple interface (session shown below). I haven't fully tested the performance, but a million records with 5 fields takes about 11 seconds on my Mac to do a 'mean'. I'm not sure what your performance considerations are, but this may be useful. Record arrays are really nice if they make sense for your data. Travis (from an ipython command prompt) In [1]: import testpivot as p In [2]: a = p.sample_data() In [3]: a Out[3]: recarray([('ACorp', 'Region 1', 'Q1', 20000.0), ('ACorp', 'Region 1', 'Q2', 22000.0), ('ACorp', 'Region 1', 'Q3', 21000.0), ('ACorp', 'Region 1', 'Q4', 26000.0), ('ACorp', 'Region 2', 'Q1', 23000.0), ('ACorp', 'Region 2', 'Q2', 20000.0), ('ACorp', 'Region 2', 'Q3', 22000.0), ('ACorp', 'Region 2', 'Q4', 21000.0), ('ACorp', 'Region 3', 'Q1', 26000.0), ('ACorp', 'Region 3', 'Q2', 23000.0), ('ACorp', 'Region 3', 'Q3', 29000.0), ('ACorp', 'Region 3', 'Q4', 27000.0), ('BCorp', 'Region 1', 'Q1', 20000.0), ('BCorp', 'Region 1', 'Q2', 20000.0), ('BCorp', 'Region 1', 'Q3', 24000.0), ('BCorp', 'Region 1', 'Q4', 24000.0), ('BCorp', 'Region 2', 'Q1', 21000.0), ('BCorp', 'Region 2', 'Q2', 21000.0), ('BCorp', 'Region 2', 'Q3', 22000.0), ('BCorp', 'Region 2', 'Q4', 29000.0), ('BCorp', 'Region 3', 'Q1', 28000.0), ('BCorp', 'Region 3', 'Q2', 25000.0), ('BCorp', 'Region 3', 'Q3', 22000.0), ('BCorp', 'Region 3', 'Q4', 21000.0)], dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '| S2'), ('income', '<f8')]) In [4]: p.pivot(a, 'company', 'region', 'income', p.psum) ######## Summary by company and region ########## cols:['ACorp' 'BCorp'] rows:['Region 1' 'Region 2' 'Region 3'] [[ 89000. 88000.] [ 86000. 93000.] [ 105000. 96000.]] In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 69000. 69000.] [ 65000. 66000.] [ 72000. 68000.] [ 74000. 74000.]] In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean) ######## Summary by company and quarter ########## cols:['ACorp' 'BCorp'] rows:['Q1' 'Q2' 'Q3' 'Q4'] [[ 23000. 23000. ] [ 21666.66666667 22000. ] [ 24000. 22666.66666667] [ 24666.66666667 24666.66666667]] On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
![](https://secure.gravatar.com/avatar/34d60159c058e1e0701eb911f6050500.jpg?s=120&d=mm&r=g)
What is ugly about the module? I like it! What do you mean about recarray's? Do you think they are they not appropriate for this type of thing? When i get some time i'll run some tests versus SAS for the same operations and do a speed comparison. Question: Would there be an easy way to merge the summary stats back into the recarray? Best, Vincent On Aug 1, 11:22 pm, Travis Vaught <tra...@enthought.com> wrote:
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Nicely done Travis. Working code is always better than theory. I copied your interface and used the brute-force, non-numpy approach to construct the pivot table. On the one hand, it doesn't preserve the order that the entires are discovered in as the original does. On the other hand, it's about 40% faster for large files on my machine (see pivot2). Probably because you don't have to loop through the data so many times. You can get further improvements if you know the operation in advance as shown in pivotsum, although this won't work on median ASAIK. regards, -tim On 8/1/07, Travis Vaught <travis@enthought.com> wrote:
-- . __ . |-\ . . tim.hochberg@ieee.org
participants (5)
-
Bruce Southey
-
Geoffrey Zhu
-
Timothy Hochberg
-
Travis Vaught
-
Vincent