# finding most common elements between thousands of multiple arrays.

Scott David Daniels Scott.Daniels at Acm.Org
Sat Jul 4 16:19:48 CEST 2009

Vilya Harvey wrote:
> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>> Currently I need to find the most common elements in thousands of
>>> arrays within one large array (arround 2 million instances with ~70k
>>> unique elements)...
> Try flattening the arrays into a single large array & sorting it. Then
> you can just iterate over the large array counting as you go; you only
> ever have to insert into the dict once for each value and there's no
> lookups in the dict....

Actually the next step is to maintain a min-heap as you run down the
sorted array.  Something like:

import numpy as np
import heapq

def frequency(arr):
'''Generate frequency-value pairs from a numpy array'''
clustered = arr.flatten() # copy (so can safely sort)
clustered.sort() # Bring identical values together
scanner = iter(clustered)
last = scanner.next()
count = 1
for el in scanner:
if el == last:
count += 1
else:
yield count, last
last = el
count = 1
yield count, last

def most_frequent(arr, N):
'''Return the top N (freq, val) elements in arr'''
counted = frequency(arr) # get an iterator for freq-val pairs
heap = []
# First, just fill up the array with the first N distinct
for i in range(N):
try:
heap.append(counted.next())
except StopIteration:
break # If we run out here, no need for a heap
else:
# more to go, switch to a min-heap, and replace the least
# element every time we find something better
heapq.heapify(heap)
for pair in counted:
if pair > heap[0]:
heapq.heapreplace(heap, pair)
return sorted(heap, reverse=True) # put most frequent first.

--Scott David Daniels
Scott.Daniels at Acm.Org