[Numpy-discussion] How to implement a 'pivot table?'

Tue Jul 31 16:53:55 EDT 2007

Generating these types of summary statistics is very common in SAS. In
SAS you would set up a sequence of procedures. First sort by the
variables of interest and then calculate the metrics of interest by
the combination of values. In numpy/scipy this might be something
like:

1. Sort by date and region
2. Determine 1st and last index of the blocks
3. Calculate mean,sum, etc. for each of the blocks.

Based on the earlier arguments I am wondering, however, if this would
provide any speed up. I am very interested in this issue so if you
implement some general procedures and perhaps speed tests please share
them with the list.

Best,

Vincent

On Jul 31, 10:00 am, "Geoffrey Zhu" <zyzhu2... at gmail.com> wrote:
> Hi Timothy,
>
> On 7/31/07, Timothy Hochberg <tim.hochb... at ieee.org> wrote:
>
> > [SNIP]
>
> > > The 'brute-force' way is basically what you suggested -- looping
> > > through all the records and building a two-way hash-table of the data.
>
> > > The problem of the brute-force' approach is that it is not taking
> > > advantage of facilities of numpy and can be slow in speed. If only
> > > there is some built-in mechanism in numpy to handle this.
>
> > I'm curious; have you tried this and found it slow, or is this a hunch based
> > on the reputed slowness of Python? Algorithmically, you can't do much
> > better: the dictionary and set operations are O(1), so the whole operation
> > is O(N), and you won't do any better than that, order wise. What your left
> > with is trying to reduce constant factors.
>
> I agree that algorithmically you can't do much better. It is basically
> a C vs Python thing. One advantage of numpy is that you can do
> vectorized operations at the speed of C. With this algorithm, we have
> to process the data element by element and the speed advantage of
> numpy is lost. Since data has to be stored in python sets and maps, I
> imagine the storage advantage is also lost.
>
> I was in fact looking for some implemention of this algorithm in numpy
> (and so C) that does exactly this, or some implementation of this
> algorithm that can leverage the fast numpy routines to do this.
>
> I haven't tried it with the real data load yet. I know the number of
> records will be huge and it is just a hunch that it will be slow.
>
> > There are various ways one might go about reducing constant factors, but
> > they depend on the details of the problem. For example, if the dates are
> > dense and you are going to parse them anyway, you could replace the hash
> > with table that you index into with the date as an integer. I'm not sure
> > that you are going to do a lot better than the brute force algorithm in the
> > generic force case though.
>
> Unfortunately it has to be something generic.
>
> Thanks a lot for your help.
> Geoffrey
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discuss... at scipy.orghttp://projects.scipy.org/mailman/listinfo/numpy-discussion