finding most common elements between thousands of multiple arrays.

mclovin hanooter at gmail.com
Sun Jul 5 00:06:29 CEST 2009


On Jul 4, 12:51 pm, Scott David Daniels <Scott.Dani... at Acm.Org> wrote:
> mclovin wrote:
> > OK then. I will try some of the strategies here but I guess things
> > arent looking too good. I need to run this over a dataset that someone
> > pickled. I need to run this 480,000 times so you can see my
> > frustration. So it doesn't need to be "real time" but it would be nice
> > it was done sorting this month.
>
> > Is there a "bet guess" strategy where it is not 100% accurate but much
> > faster?
>
> Well, I timed a run of a version of mine, and the scan is approx 5X
> longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
> see how quickly the copy and sort happens.So you could try a variant
> exploiting the following property:
>      If you know the minimum length of a run that will be in the top 25,
> then the value for each of the most-frequent run entries must show up at
> positions n * stride and (n + 1) * stride (for some n).  That should
> drastically reduce the scan cost, as long as stride is reasonably large.
>
> For my uniformly distributed 0..1024 values in 5M x 5M array,
> About 2.5 sec to flatten and sort.
> About 15 sec to run one of my heapish thingies.
> the least frequency encountered: 24716
> so, with stride at
>
> sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
> So there are only 1000 points to investigate.
> With any distribution other than uniform, that should go _way_ down.
> So just pull out those points, use bisect to get their frequencies, and
> feed those results into the heap accumulation.
>
> --Scott David Daniels

I dont quite understand what you are saying but I know this: the times
the most common element appears varies greatly. Sometimes it appears
over 1000 times, and some times it appears less than 50. It all
depends on the density of the arrays I am analyzing.

like I said I need to do this 480,000 times so to get this done
realistically I need to analyse about 5 a second. It appears that the
average matrix size contains about 15 million elements.

I threaded my program using your code and I did about 1,000 in an hour
so it is still much too slow.

When I selected 1 million random elements to count, 8 out of the top
10 of those were in the top 25 of the precise way and 18 of the 25
were in the top 25 of the precise way. so I suppose that could be an
option.



More information about the Python-list mailing list