finding most common elements between thousands of multiple arrays.
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Jul 4 09:42:06 EDT 2009
On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>> Currently I need to find the most common elements in thousands of
>>> arrays within one large array (arround 2 million instances with ~70k
>>> unique elements)
...
>> There's no better algorithm for the general case. No method of checking
>> the matrices using less than 2000000-x look-ups will ensure you that
>> there's not a new value with x occurences lurking somewhere.
>
> Try flattening the arrays into a single large array & sorting it. Then
> you can just iterate over the large array counting as you go; you only
> ever have to insert into the dict once for each value and there's no
> lookups in the dict.
You're suggesting to do a whole bunch of work copying 2,000,000 pointers
into a single array, then a whole bunch of more work sorting that second
array (which is O(N*log N) on average), and then finally iterate over the
second array. Sure, that last step will on average involve fewer than
O(N) steps, but to get to that point you've already done more work than
just iterating over the array-of-arrays in the first place.
Now, if you're really lucky, your strategy can be done in fast C code
instead of slow Python code, and you might see a speed-up for values of N
which aren't too big. But I wouldn't put money on it.
Another strategy might be to pre-count elements in each array, as you
build or modify them. This will marginally slow down each modification
you make to the array, but the payback will be that finding the frequency
of any element will be almost instantaneous.
--
Steven
More information about the Python-list
mailing list