finding most common elements between thousands of multiple arrays.

Scott David Daniels Scott.Daniels at Acm.Org
Sat Jul 4 10:19:48 EDT 2009


Vilya Harvey wrote:
> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>> Currently I need to find the most common elements in thousands of
>>> arrays within one large array (arround 2 million instances with ~70k
>>> unique elements)...
> Try flattening the arrays into a single large array & sorting it. Then
> you can just iterate over the large array counting as you go; you only
> ever have to insert into the dict once for each value and there's no
> lookups in the dict....

Actually the next step is to maintain a min-heap as you run down the
sorted array.  Something like:

import numpy as np
import heapq


def frequency(arr):
     '''Generate frequency-value pairs from a numpy array'''
     clustered = arr.flatten() # copy (so can safely sort)
     clustered.sort() # Bring identical values together
     scanner = iter(clustered)
     last = scanner.next()
     count = 1
     for el in scanner:
         if el == last:
             count += 1
         else:
             yield count, last
             last = el
             count = 1
     yield count, last


def most_frequent(arr, N):
     '''Return the top N (freq, val) elements in arr'''
     counted = frequency(arr) # get an iterator for freq-val pairs
     heap = []
     # First, just fill up the array with the first N distinct
     for i in range(N):
         try:
             heap.append(counted.next())
         except StopIteration:
             break # If we run out here, no need for a heap
     else:
         # more to go, switch to a min-heap, and replace the least
         # element every time we find something better
         heapq.heapify(heap)
         for pair in counted:
             if pair > heap[0]:
                 heapq.heapreplace(heap, pair)
     return sorted(heap, reverse=True) # put most frequent first.


--Scott David Daniels
Scott.Daniels at Acm.Org



More information about the Python-list mailing list