[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Oscar Benjamin report at bugs.python.org
Fri Jul 17 20:21:04 EDT 2020


Oscar Benjamin <oscar.j.benjamin at gmail.com> added the comment:

> At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls.

This comment suggest that you have missed the general motivation for reservoir sampling. Of course the stdlib can not satisfy all use cases so this can be out of scope as a feature. It is not the case though that this is a CPython optimisation.

The idea of reservoir sampling is that you want to sample from an iterator, you only get one chance to iterate over it, and you don't know a priori how many items it will yield. The classic example of that situation is reading from a text file but in general it maps neatly onto Python's concept of iterators. The motivation for generators/iterators in Python is that there are many situations where it is better to avoid building a concrete in-memory data structure and it can be possible to avoid doing so with appropriately modified algorithms (such as this one). 

The core use case for this feature is not sampling from an in-memory data structure but rather sampling from an expensive generator or an iterator over a file/database. The premise is that it is undesirable or perhaps impossible to build a list out of the items of the iterable. In those contexts the comparative properties of sample/choices are to some extent irrelevant because those APIs can not be used or should be avoided because of their overhead in terms of memory or other resources.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________


More information about the Python-bugs-list mailing list