sampling from frequency distribution / histogram without replacement
Spencer Graves
spencer.graves at effectivedefense.org
Mon Jan 14 21:41:05 EST 2019
On 2019-01-14 18:40, duncan smith wrote:
> On 14/01/2019 22:59, Gregory Ewing wrote:
>> duncan smith wrote:
>>> Hello,
>>> Just checking to see if anyone has attacked this problem before
>>> for cases where the population size is unfeasibly large.
>> The fastest way I know of is to create a list of cumulative
>> frequencies, then generate uniformly distributed numbers and
>> use a binary search to find where they fall in the list.
>> That's O(log n) per sample in the size of the list once it's
>> been set up.
>>
> That's the sort of thing I've been thinking about. But once I'd found
> the relevant category I'd need to reduce its frequency by 1 and
> correspondingly update the cumulative frequencies. Alternatively, I
> could add an extra step where I selected a unit from the relevant
> category with probability equal to the proportion of non-sampled units
> from the category. I could maybe set up an alias table and do something
> similar.
>
> The other thing I was thinking about was iterating through the
> categories (ideally from largest frequency to smallest frequency),
> generating the numbers to be sampled from the current category and the
> remaining categories (using numpy.random.hypergeometric). With a few
> large frequencies and lots of small frequencies that could be quite
> quick (on average). Alternatively I could partition the categories into
> two sets, generate the number to be sampled from each partition, then
> partition the partitions etc. binary search style.
>
> I suppose I'll try the both the alias table + rejection step and the
> recursive partitioning approach and see how they turn out. Cheers.
R has functions "sample" and "sample.int"; see
"https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample".
You can call R from Python,
"https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python".
These are in the "base" package. I believe they have been an
important part of the base R language almost since its inception and
have been used extensively. You'd have to work really hard to do
better, in my judgment.
Spencer Graves
DISCLAIMER: I'm primarily an R guy and only use Python when I can't
find a sensible way to do what I want in R.
>
> Duncan
More information about the Python-list
mailing list