[Tutor] fast sampling with replacement
Steven D'Aprano
steve at pearwood.info
Sun Feb 21 18:34:19 CET 2010
On Sun, 21 Feb 2010 03:22:19 am Andrew Fithian wrote:
> Hi tutor,
>
> I'm have a statistical bootstrapping script that is bottlenecking on
> a python function sample_with_replacement(). I wrote this function
> myself because I couldn't find a similar function in python's random
> library.
random.choice(list) does sample with replacement. If you need more than
one sample, call it in a loop or a list comprehension:
>>> mylist = [1, 2, 4, 8, 16, 32, 64, 128]
>>> choice = random.choice
>>> values = [choice(mylist) for _ in xrange(4)]
>>> values
[64, 32, 4, 4]
> This is the fastest version of the function I could come up
> with (I used cProfile.run() to time every version I wrote) but it's
> not fast enough, can you help me speed it up even more?
Profiling doesn't tell you how *fast* something is, but how much time it
is using. If something takes a long time to run, perhaps that's because
you are calling it lots of times. If that's the case, you should try to
find a way to call it less often. E.g. instead of taking a million
samples, can you get by with only a thousand? Do you need to prepare
all the samples ahead of time, or do them only when needed?
> import random
> def sample_with_replacement(list):
> l = len(list) # the sample needs to be as long as list
> r = xrange(l)
> _random = random.random
> return [list[int(_random()*l)] for i in r] # using
> list[int(_random()*l)] is faster than random.choice(list)
Well, maybe... it's a near thing. I won't bother showing my timing code
(hint: use the timeit module) unless you ask, but in my tests, I don't
get a clear consistent winner between random.choice and your version.
Your version is a little bit faster about 2 out of 3 trials, but slower
1 out of 3.
This isn't surprising if you look at the random.choice function:
def choice(self, seq):
return seq[int(self.random() * len(seq))]
It is essentially identical to your version, the major difference being
you use the same algorithm directly in a list comprehension instead of
calling it as a function. So I would expect any difference to be small.
I would say that you aren't going to speed up the sampling algorithm in
pure Python. This leaves you with some options:
(1) Live with the speed as it is. Are you sure it's too slow?
(2) Re-write the random number generator in C.
(3) Find a way to use fewer samples.
(4) Use a lower-quality random number generator which is faster.
> FWIW, my bootstrapping script is spending roughly half of the run
> time in sample_with_replacement() much more than any other function
> or method. Thanks in advance for any advice you can give me.
You don't tell us whether that actually matters. Is it taking 3 hours in
a 6 hour run time? If so, then it is worth spending time and effort to
cut that down by a few hours. Or is it taking 0.2 seconds out of 0.4
seconds? If that's the case, then who cares? :)
--
Steven D'Aprano
More information about the Tutor
mailing list