finding most common elements between thousands of multiple arrays.

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Jul 4 10:38:52 EDT 2009


On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:

> On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
> 
>> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>>> Currently I need to find the most common elements in thousands of
>>>> arrays within one large array (arround 2 million instances with ~70k
>>>> unique elements)
> ...
>>> There's no better algorithm for the general case. No method of
>>> checking the matrices using less than 2000000-x look-ups will ensure
>>> you that there's not a new value with x occurences lurking somewhere.
>> 
>> Try flattening the arrays into a single large array & sorting it. Then
>> you can just iterate over the large array counting as you go; you only
>> ever have to insert into the dict once for each value and there's no
>> lookups in the dict.
> 
> You're suggesting to do a whole bunch of work copying 2,000,000 pointers
> into a single array, then a whole bunch of more work sorting that second
> array (which is O(N*log N) on average), and then finally iterate over
> the second array. Sure, that last step will on average involve fewer
> than O(N) steps, 

Er what?

Ignore that last comment -- I don't know what I was thinking. You still 
have to iterate over all N elements, sorted or not.

> but to get to that point you've already done more work
> than just iterating over the array-of-arrays in the first place.

What it does buy you though, as you pointed out, is reducing the number 
of explicit dict lookups and writes. However, dict lookups and writes are 
very fast, fast enough that they're used throughout Python. A line like:

count += 1

actually is a dict lookup and write.



-- 
Steven



More information about the Python-list mailing list