Hi Timothy, On 7/31/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
[SNIP]
The 'brute-force' way is basically what you suggested -- looping through all the records and building a two-way hash-table of the data.
The problem of the brute-force' approach is that it is not taking advantage of facilities of numpy and can be slow in speed. If only there is some built-in mechanism in numpy to handle this.
I'm curious; have you tried this and found it slow, or is this a hunch based on the reputed slowness of Python? Algorithmically, you can't do much better: the dictionary and set operations are O(1), so the whole operation is O(N), and you won't do any better than that, order wise. What your left with is trying to reduce constant factors.
I agree that algorithmically you can't do much better. It is basically a C vs Python thing. One advantage of numpy is that you can do vectorized operations at the speed of C. With this algorithm, we have to process the data element by element and the speed advantage of numpy is lost. Since data has to be stored in python sets and maps, I imagine the storage advantage is also lost. I was in fact looking for some implemention of this algorithm in numpy (and so C) that does exactly this, or some implementation of this algorithm that can leverage the fast numpy routines to do this. I haven't tried it with the real data load yet. I know the number of records will be huge and it is just a hunch that it will be slow.
There are various ways one might go about reducing constant factors, but they depend on the details of the problem. For example, if the dates are dense and you are going to parse them anyway, you could replace the hash with table that you index into with the date as an integer. I'm not sure that you are going to do a lot better than the brute force algorithm in the generic force case though.
Unfortunately it has to be something generic. Thanks a lot for your help. Geoffrey