For loop searching takes too long!
Steve Holden
steve at holdenweb.com
Thu Jan 28 22:02:01 EST 2010
Steve Holden wrote:
> Dave Angel wrote:
>> elsa wrote:
>>> Hi guys,
>>>
>>> I've got a problem with my program, in that the code just takes too
>>> long to run. Here's what I'm doing. If anyone has any tips, they'd be
>>> much appreciated!
>>>
>>> So, say I have a list of lists that looks something like this (I'm
>>> using a list of lists, rather than a list of tuples, as I need it to
>>> be mutable):
>>>
>>> myList = [[786,0],[45, 1],[673,1],...................[23,46]]
>>>
>>> there are enough entries in the outer list, that the sum of myList[i]
>>> [0] across all i could be as high as 10^7.
>>>
>>> Now, what I need to do is randomly choose one myList[i], however the
>>> distribution of my random choice needs to be proportional to the
>>> values of myList[i][0]. So, for this list above, I'd have a much
>>> higher chance of choosing myList[0] than myList[1].
>>>
>>> Here is how I'm doing it at the moment:
>>>
>>> def chooseI(myList):
>>> mySum=0
>>> choice = random.choice(range(1,sum([i[0] for i in myList])+1))
>>> for i in range(len(myList)):
>>> mySum+=myList[i][0]
>>> if mySum>=choice:
>>> return i
>>> break
>>>
>>> This works just fine if sum([i[0] for i in myList]) is less than
>>> 10,000, say. However if its up around 10^7, the whole thing crashes.
>>> Is there a faster way of doing this, that doesn't involve as many
>>> computational steps?
>>>
>>> Thanks!
>>>
>>> elsa
>>>
>>>
>> "the whole thing crashes" -- give us the specific crash you see,
>> including the stack trace. "Crash" is far too vague.
>>
>> At first glance you could be running out of memory. But not with the
>> values you're quoting. If a typical average value for i[0] is 50,
>> that's only 200k elements. Anyway, you could print out the len(myList)
>> to see how big the list is.
>>
>> There are quicker ways to do the sum, but the first thing to change is
>> the random.choice() stuff. Once you have a sum, you can use
>> random.randint() on it directly. No need to make a range list.
>>
>> Another approach might be to build a new list with a running sum in it.
>> Then after doing the randint() on the last (highest) item, you can do a
>> bisect.bisect on that list. The index that returns will be your return
>> value.
>>
> Your approach seems to validate an informal impression I had that with
> N choices one ought to be able to build a binary tree where each
> decision went to left or right according to whether a number drawn from
> a linear [0,1) distribution was greater that some arbitrary margin, or not.
>
> There are then arithmetical questions of exactly how to arrive at the
> correct threshold values for each bifurcation. If the most probable
> paths are deliberately placed at the top of the tree then the most
> frequent cases will be fastest to generate, also.
>
Bad form, I know, to reply to oneself, but since I was egotistical
enough to read what I wrote again when it was published I must also
record the conjecture that the resulting alphabet, expressed as the
binary history of the bifurcations chosen for any given symbol, must
surely be a Hamming code to be most efficient.
No proof is adduced.
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/
More information about the Python-list
mailing list