Hi,
numpy doesn't seem to have a function for sampling from simple categorical distributions. The easiest solution I could come up with was something like
from numpy.random import multinomial multinomial(1, [.5, .3, .2]).nonzero()[0][0]
1
but this is bound to be inefficient as soon as the vector of probabilities gets large, especially if you want to draw multiple samples.
Have I overlooked something or should this be added?
 Hagen
On 20101122, at 2:51 AM, Hagen Fürstenau wrote:
but this is bound to be inefficient as soon as the vector of probabilities gets large, especially if you want to draw multiple samples.
Have I overlooked something or should this be added?
I think you misunderstand the point of multinomial distributions. A sample from a multinomial is simply a sample from n i.i.d. categoricals, reported as the counts for each category in the N observations. It's very easy to recover the 'categorical' samples from a 'multinomial' sample.
import numpy as np a = np.random.multinomial(50, [.3, .3, .4]) b = np.zeros(50, dtype=int) upper = np.cumsum(a); lower = upper  a
for value in range(len(a)): b[lower[value]:upper[value]] = value # mix up the order, inplace, if you care about them not being sorted np.random.shuffle(b)
then b is a sample from the corresponding 'categorical' distribution.
David
but this is bound to be inefficient as soon as the vector of probabilities gets large, especially if you want to draw multiple samples.
Have I overlooked something or should this be added?
I think you misunderstand the point of multinomial distributions.
I'm afraid the multiple samples were an afterthought and a false lead. My main use case would be consecutive samples of _different_ categorical distributions, for which abusing multinomials seems wasteful (as they have to allocate a large vector each time). Similarly for the alternative spelling
import numpy, random a = numpy.array([.5, .3, .2]) (a.cumsum()random.random() >= 0).nonzero()[0][0]
1
ISTM that this elementary functionality deserves an implementation that's as fast as it can be.
 Hagen
ISTM that this elementary functionality deserves an implementation that's as fast as it can be.
To substantiate this, I just wrote a simple implementation of "categorical" in "numpy/random/mtrand.pyx" and it's more than 8x faster than your version for multiple samples of the same distribution and more than 3x faster than using "multinomial(1, ...)" for multiple samples of different distributions (each time tested with 1000 samples drawn from distributions over 1000 categories).
I can provide it as a patch if there's any interest.
 Hagen
On Mon, Nov 22, 2010 at 6:05 AM, Hagen Fürstenau hagen@zhuliguan.net wrote:
ISTM that this elementary functionality deserves an implementation that's as fast as it can be.
To substantiate this, I just wrote a simple implementation of "categorical" in "numpy/random/mtrand.pyx" and it's more than 8x faster than your version for multiple samples of the same distribution and more than 3x faster than using "multinomial(1, ...)" for multiple samples of different distributions (each time tested with 1000 samples drawn from distributions over 1000 categories).
I can provide it as a patch if there's any interest.
Can you compare the speed of your cython solution with the version of Chuck
 For instance, weight 0..3 by 1..4, then
In [14]: w = arange(1,5)
In [15]: p = cumsum(w)/float(w.sum())
In [16]: bincount(p.searchsorted(random(1000000)))/1e6 Out[16]: array([ 0.100336, 0.200382, 0.299132, 0.40015 ])
 from numpy mailing list thread "Weighted random integers", sep. 10 Using searchsorted hat roughly a 10 times speedup compared to my multinomial version
Josef
 Hagen
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Can you compare the speed of your cython solution with the version of Chuck
For multiple samples of the same distribution, it would do more or less the same as the "searchsorted" method, so I don't expect any improvement (except for being easier to find).
For multiple samples of different distributions, my version is 45x faster than "searchsorted(random())". This is without normalizing the probability vector, which means that you typically don't have to sum up the whole vector (and store all the intermediate sums).
 Hagen
participants (3)

David WardeFarley

Hagen Fürstenau

josef.pktd＠gmail.com