finding most common elements between thousands of multiple arrays.

Sat Jul 4 03:45:32 EDT 2009

On Sat, Jul 4, 2009 at 12:33 AM, mclovin<hanooter at gmail.com> wrote:
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -----for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> -----------d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
>  It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

Using a defaultdict
(http://docs.python.org/library/collections.html#collections.defaultdict)
would speed it up (but only slightly) and make it clearer to read.

Cheers,
Chris
-- 
http://blog.rebertia.com