finding most common elements between thousands of multiple arrays.
hanooter at gmail.com
Sat Jul 4 12:22:48 EDT 2009
OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.
Is there a "bet guess" strategy where it is not 100% accurate but much
On Jul 4, 7:38 am, Steven D'Aprano <st... at REMOVE-THIS-
> On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:
> > On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
> >> 2009/7/4 Andre Engels <andreeng... at gmail.com>:
> >>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanoo... at gmail.com> wrote:
> >>>> Currently I need to find the most common elements in thousands of
> >>>> arrays within one large array (arround 2 million instances with ~70k
> >>>> unique elements)
> > ...
> >>> There's no better algorithm for the general case. No method of
> >>> checking the matrices using less than 2000000-x look-ups will ensure
> >>> you that there's not a new value with x occurences lurking somewhere.
> >> Try flattening the arrays into a single large array & sorting it. Then
> >> you can just iterate over the large array counting as you go; you only
> >> ever have to insert into the dict once for each value and there's no
> >> lookups in the dict.
> > You're suggesting to do a whole bunch of work copying 2,000,000 pointers
> > into a single array, then a whole bunch of more work sorting that second
> > array (which is O(N*log N) on average), and then finally iterate over
> > the second array. Sure, that last step will on average involve fewer
> > than O(N) steps,
> Er what?
> Ignore that last comment -- I don't know what I was thinking. You still
> have to iterate over all N elements, sorted or not.
> > but to get to that point you've already done more work
> > than just iterating over the array-of-arrays in the first place.
> What it does buy you though, as you pointed out, is reducing the number
> of explicit dict lookups and writes. However, dict lookups and writes are
> very fast, fast enough that they're used throughout Python. A line like:
> count += 1
> actually is a dict lookup and write.
More information about the Python-list