random.sample with large weighted sample-sets?
duncan smith
buzzard at invalid.invalid
Sun Feb 16 18:38:38 CET 2014
On 16/02/14 16:35, Charles Allen wrote:
> How efficient does this thing need to be?
>
> You can always just turn it into a two-dimensional sampling problem by
> thinking of the data as a function f(x=item), generating a random x=xr
> in [0,x], then generating a random y in [0,max(f(x))]. The xr is
> accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
> y > max(f(xr)).
>
You can avoid rejection by constructing an alias table. A list can be
constructed such that each list element contains a pair of values and a
cutoff. e.g.
[("apple", 20), ("orange", 50), ("grape", 30)]
would become (using one particular algorithm)
[(("apple", "orange"), 0.6),
(("orange", "apple"), 1.0),
(("grape", "orange"), 0.9)]
Generate a random index, then select one of the values on the basis of
the cutoff. For short enough lists you can generate a single 0-1 random
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to
the cutoff, where n is the length of the list. It's still sampling with
replacement though.
Duncan
More information about the Python-list
mailing list