Multiple disjoint sample sets?
Peter Otten
__peter__ at web.de
Sun Jan 13 11:16:03 CET 2013
Roy Smith wrote:
> I have a list of items. I need to generate n samples of k unique items
> each. I not only want each sample set to have no repeats, but I also
> want to make sure the sets are disjoint (i.e. no item repeated between
> sets).
>
> random.sample(items, k) will satisfy the first constraint, but not the
> second. Should I just do random.sample(items, k*n), and then split the
> resulting big list into n pieces? Or is there some more efficient way?
>
> Typical values:
>
> len(items) = 5,000,000
> n = 10
> k = 100,000
I would expect that your simple approach is more efficient than shuffling
the whole list.
Assuming there is a sample_iter(population) that generates unique items from
the population (which has no repetitions itself) you can create the samples
with
g = sample_iter(items)
samples = [list(itertools.islice(g, k) for _ in xrange(n)]
My ideas for such a sample_iter():
def sample_iter_mark(items):
n = len(items)
while True:
i = int(random()*n)
v = items[i]
if v is not None:
yield v
items[i] = None
This is destructive and will degrade badly as the number of None items
increases. For your typical values it seems to be OK though. You can make
this non-destructive by adding a bit array or a set (random.Random.sample()
has code that uses a set) to keep track of the seen items.
Another sample_iter() (which is also part of the random.Random.sample()
implementation):
def sample_iter_replace(items):
n = len(items)
for k in xrange(n):
i = int(random()*(n-k))
yield items[i]
items[i] = items[n-k-1]
You can micro-optimise that a bit to avoid the index calculation. Also,
instead of overwriting items you could swap them, so that no values would be
lost, only their initial order.
More information about the Python-list
mailing list