[Tutor] fast sampling with replacement

Sun Feb 21 18:34:19 CET 2010

On Sun, 21 Feb 2010 03:22:19 am Andrew Fithian wrote:
> Hi tutor,
>
> I'm have a statistical bootstrapping script that is bottlenecking on
> a python function sample_with_replacement(). I wrote this function
> myself because I couldn't find a similar function in python's random
> library. 

random.choice(list) does sample with replacement. If you need more than 
one sample, call it in a loop or a list comprehension:

>>> mylist = [1, 2, 4, 8, 16, 32, 64, 128]
>>> choice = random.choice
>>> values = [choice(mylist) for _ in xrange(4)]
>>> values
[64, 32, 4, 4]

> This is the fastest version of the function I could come up 
> with (I used cProfile.run() to time every version I wrote) but it's
> not fast enough, can you help me speed it up even more?

Profiling doesn't tell you how *fast* something is, but how much time it 
is using. If something takes a long time to run, perhaps that's because 
you are calling it lots of times. If that's the case, you should try to 
find a way to call it less often. E.g. instead of taking a million 
samples, can you get by with only a thousand? Do you need to prepare 
all the samples ahead of time, or do them only when needed?

> import random
> def sample_with_replacement(list):
>     l = len(list) # the sample needs to be as long as list
>     r = xrange(l)
>     _random = random.random
>     return [list[int(_random()*l)] for i in r] # using
> list[int(_random()*l)] is faster than random.choice(list)

Well, maybe... it's a near thing. I won't bother showing my timing code 
(hint: use the timeit module) unless you ask, but in my tests, I don't 
get a clear consistent winner between random.choice and your version. 
Your version is a little bit faster about 2 out of 3 trials, but slower 
1 out of 3.

This isn't surprising if you look at the random.choice function:

    def choice(self, seq):
        return seq[int(self.random() * len(seq))]

It is essentially identical to your version, the major difference being 
you use the same algorithm directly in a list comprehension instead of 
calling it as a function. So I would expect any difference to be small.

I would say that you aren't going to speed up the sampling algorithm in 
pure Python. This leaves you with some options:

(1) Live with the speed as it is. Are you sure it's too slow?

(2) Re-write the random number generator in C.

(3) Find a way to use fewer samples.

(4) Use a lower-quality random number generator which is faster.

> FWIW, my bootstrapping script is spending roughly half of the run
> time in sample_with_replacement() much more than any other function
> or method. Thanks in advance for any advice you can give me. 

You don't tell us whether that actually matters. Is it taking 3 hours in 
a 6 hour run time? If so, then it is worth spending time and effort to 
cut that down by a few hours. Or is it taking 0.2 seconds out of 0.4 
seconds? If that's the case, then who cares? :)

-- 
Steven D'Aprano