sampling from frequency distribution / histogram without replacement

Mon Jan 14 21:41:05 EST 2019

On 2019-01-14 18:40, duncan smith wrote:
> On 14/01/2019 22:59, Gregory Ewing wrote:
>> duncan smith wrote:
>>> Hello,
>>>        Just checking to see if anyone has attacked this problem before
>>> for cases where the population size is unfeasibly large.
>> The fastest way I know of is to create a list of cumulative
>> frequencies, then generate uniformly distributed numbers and
>> use a binary search to find where they fall in the list.
>> That's O(log n) per sample in the size of the list once it's
>> been set up.
>>
> That's the sort of thing I've been thinking about. But once I'd found
> the relevant category I'd need to reduce its frequency by 1 and
> correspondingly update the cumulative frequencies. Alternatively, I
> could add an extra step where I selected a unit from the relevant
> category with probability equal to the proportion of non-sampled units
> from the category. I could maybe set up an alias table and do something
> similar.
>
> The other thing I was thinking about was iterating through the
> categories (ideally from largest frequency to smallest frequency),
> generating the numbers to be sampled from the current category and the
> remaining categories (using numpy.random.hypergeometric). With a few
> large frequencies and lots of small frequencies that could be quite
> quick (on average). Alternatively I could partition the categories into
> two sets, generate the number to be sampled from each partition, then
> partition the partitions etc. binary search style.
>
> I suppose I'll try the both the alias table + rejection step and the
> recursive partitioning approach and see how they turn out. Cheers.

       R has functions "sample" and "sample.int";  see 
"https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample". 
You can call R from Python, 
"https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python". 

       These are in the "base" package.  I believe they have been an 
important part of the base R language almost since its inception and 
have been used extensively.  You'd have to work really hard to do 
better, in my judgment.

       Spencer Graves

DISCLAIMER:  I'm primarily an R guy and only use Python when I can't 
find a sensible way to do what I want in R.
>
> Duncan