Re: [Numpy-discussion] How to implement a 'pivot table?'

31 Jul 2007

      Generating these types of summary statistics is very common in SAS. In
SAS you would set up a sequence of procedures. First sort by the
variables of interest and then calculate the metrics of interest by
the combination of values. In numpy/scipy this might be something
like:

1. Sort by date and region
2. Determine 1st and last index of the blocks
3. Calculate mean,sum, etc. for each of the blocks.

Based on the earlier arguments I am wondering, however, if this would
provide any speed up. I am very interested in this issue so if you
implement some general procedures and perhaps speed tests please share
them with the list.

Best,

Vincent

On Jul 31, 10:00 am, "Geoffrey Zhu"  wrote:
...
Hi Timothy,
On 7/31/07, Timothy Hochberg  wrote:
...
[SNIP]
...
...
The 'brute-force' way is basically what you suggested -- looping
through all the records and building a two-way hash-table of the data.
...
...
The problem of the brute-force' approach is that it is not taking
advantage of facilities of numpy and can be slow in speed. If only
there is some built-in mechanism in numpy to handle this.
...
I'm curious; have you tried this and found it slow, or is this a hunch based
on the reputed slowness of Python? Algorithmically, you can't do much
better: the dictionary and set operations are O(1), so the whole operation
is O(N), and you won't do any better than that, order wise. What your left
with is trying to reduce constant factors.
I agree that algorithmically you can't do much better. It is basically
a C vs Python thing. One advantage of numpy is that you can do
vectorized operations at the speed of C. With this algorithm, we have
to process the data element by element and the speed advantage of
numpy is lost. Since data has to be stored in python sets and maps, I
imagine the storage advantage is also lost.
I was in fact looking for some implemention of this algorithm in numpy
(and so C) that does exactly this, or some implementation of this
algorithm that can leverage the fast numpy routines to do this.
I haven't tried it with the real data load yet. I know the number of
records will be huge and it is just a hunch that it will be slow.
...
There are various ways one might go about reducing constant factors, but
they depend on the details of the problem. For example, if the dates are
dense and you are going to parse them anyway, you could replace the hash
with table that you index into with the date as an integer. I'm not sure
that you are going to do a lot better than the brute force algorithm in the
generic force case though.
Unfortunately it has to be something generic.
Thanks a lot for your help.
Geoffrey
_______________________________________________
Numpy-discussion mailing list
Numpy-discuss...@scipy.orghttp://projects.scipy.org/mailman/listinfo/numpy-discussion