Generating these types of summary statistics is very common in SAS. In
SAS you would set up a sequence of procedures. First sort by the
variables of interest and then calculate the metrics of interest by
the combination of values. In numpy/scipy this might be something
like:
1. Sort by date and region
2. Determine 1st and last index of the blocks
3. Calculate mean,sum, etc. for each of the blocks.
Based on the earlier arguments I am wondering, however, if this would
provide any speed up. I am very interested in this issue so if you
implement some general procedures and perhaps speed tests please share
them with the list.
Best,
Vincent
On Jul 31, 10:00 am, "Geoffrey Zhu"
Hi Timothy,
On 7/31/07, Timothy Hochberg
wrote: [SNIP]
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data.
The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this.
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors.
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost.
I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this.
I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though.
Unfortunately it has to be something generic.
Thanks a lot for your help. Geoffrey _______________________________________________ Numpy-discussion mailing list Numpy-discuss...@scipy.orghttp://projects.scipy.org/mailman/listinfo/numpy-discussion