For loop searching takes too long!

Jonathan Gardner jgardner at jonathangardner.net
Fri Jan 29 17:49:06 EST 2010


On Jan 28, 3:52 pm, elsa <kerensael... at hotmail.com> wrote:
>
> I've got a problem with my program, in that the code just takes too
> long to run. Here's what I'm doing. If anyone has any tips, they'd be
> much appreciated!
>

First of all, don't play with large lists. Large lists have a tendency
to grow larger over time, until they get absurdly large.

If you're dealing with a long list, work with it like you'd work with
a file. If you need random access, stick it in some form of database,
such as BDB or PostgreSQL. If you need an index, stick it in some form
of DB. Eventually, large lists end up like that anyway. Don't fool
yourself early on---prepare for the worst and solve the problems of
tomorrow today.

> So, say I have a list of lists that looks something like this (I'm
> using a list of lists, rather than a list of tuples, as I need it to
> be mutable):
>
> myList = [[786,0],[45, 1],[673,1],...................[23,46]]
>
> there are enough entries in the outer list, that the sum of myList[i]
> [0] across all i could be as high as 10^7.
>
> Now, what I need to do is randomly choose one myList[i], however the
> distribution of my random choice needs to be proportional to the
> values of myList[i][0]. So, for this list above, I'd have a much
> higher chance of choosing myList[0] than myList[1].
>

Let's do some thinking here.

Let's say you have a list of N items. You need to choose one, but you
don't know how many there are.

One algorithm that works is you start with the first item. If that's
the only item, you've chosen it.

If there's another item, you roll the dice. There's a 1/2 chance you
drop the first item and take the sedon.

If there's a third item, you roll the dice, There's a 1/3 chance you
drop the first or second item, and take the third.

You see the pattern? Keep the nth item 1/n of the time.

In your case, you're actually pulling off N items each time, all with
the same value. Your chance of keeping the next item is (number of
items in the next one)/(total items seen so far, including the next
one). Do the math for this. It's really simple.

Let's walk through this:

[[786,0],[45, 1],[673,2],...................[23,46]]

Step 1: See [786,0]. Remember 786. Keep 0.

Step 2: See [45,1].  Remember 786+45=831. Keep 1 45/831 of the time.

Step 3: See [673,2]. Remember 831+673=1504. Keep 2 673/1504 of the
time.

Etc...

Now, the algorithm I've just described is much less memory intensive
and can deal with very long lists of numbers. However, you are calling
rand() a lot. rand() is not a cheap operation. One optimization is to
roll rand() once, and keep using that number over and over again.
Granted, for very large numbers of items, you're going to need a very
precise random number of some very small probabilities will never be
chosen. This is left as an exercise for the reader.



More information about the Python-list mailing list