finding most common elements between thousands of multiple arrays.
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Jul 4 21:39:33 EDT 2009
On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote:
> Actually the next step is to maintain a min-heap as you run down the
> sorted array. Something like:
Not bad.
I did some tests on it, using the following sample data:
arr = np.array([xrange(i, i+7000) for i in xrange(143)] +
[[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)])
and compared your code against the following simple function:
def count(arr, N):
D = {}
for v in arr:
for x in v:
D[x] = D.get(x, 0) + 1
freq = []
for el, f in D.iteritems():
freq.append((f, el))
return sorted(freq, reverse=True)[:N]
As a rough figure, your min-heap code is approximately twice as fast as
mine.
To the OP: I think you should start profiling your code and find out
exactly *where* it is slow and concentrate on that. I think that trying a
heuristic to estimate the most frequent elements by taking a random
sample is likely to be a mistake -- whatever you're trying to accomplish
with the frequency counts, the use of such a heuristic will mean that
you're only approximately accomplishing it.
--
Steven
More information about the Python-list
mailing list