finding most common elements between thousands of multiple arrays.

Andre Engels andreengels at gmail.com
Sat Jul 4 10:04:42 CEST 2009


On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -----for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> -----------d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
>  It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

There's no better algorithm for the general case. No method of
checking the matrices using less than 2000000-x look-ups will ensure
you that there's not a new value with x occurences lurking somewhere.

However, if you need this information more often, and if the number of
values that changes in between is small compared to the total size of
the matrices, you could make a gain in subsequent calculations by
remembering part results. Depending on the situation you could for
example keep the dictionary with the counts and update it each time,
or keep such a dictionary for each "big" matrix, and set a flag when
that dictionary is not up-to-date any more

.

-- 
André Engels, andreengels at gmail.com



More information about the Python-list mailing list