Making faster statistical distributions
I am already using pieces of SciPy in my Markov chain Monte Carlo package (PyMC), mostly for plotting functionality. I would also like to exploit the distributions implemented in scipy.stats, but they are far too slow for use in statistical simulation applications like MCMC, where millions of random draws may be taken. Therefore, I am thinking of implementing many of these distributions (at least the common ones) as C or Fortran extensions. I am unsure whether to use Fortran through f2py for this task, or C through weave.inline (for example). I have used both in the past for various tasks, and was generally happy with both. Any suggestions? Thanks, C. -- Christopher J. Fonnesbeck ( c h r i s @ f o n n e s b e c k . o r g ) Georgia Cooperative Fish & Wildlife Research Unit, University of Georgia
"Christopher" == Christopher Fonnesbeck <chris@fonnesbeck.org> writes:
Christopher> I am already using pieces of SciPy in my Markov chain Christopher> Monte Carlo package (PyMC), mostly for plotting Christopher> functionality. I would also like to exploit the Christopher> distributions implemented in scipy.stats, but they Christopher> are far too slow for use in statistical simulation Christopher> applications like MCMC, where millions of random Christopher> draws may be taken. Therefore, I am thinking of Christopher> implementing many of these distributions (at least Christopher> the common ones) as C or Fortran extensions. I am Christopher> unsure whether to use Fortran through f2py for this Christopher> task, or C through weave.inline (for example). I have Christopher> used both in the past for various tasks, and was Christopher> generally happy with both. Any suggestions? I've wrote some code to fill Numeric arrays with levy distributions from gsl. GSL has many, many distributions - http://www.gnu.org/software/gsl/manual/gsl-ref_19.html#SEC284 Here's a sample test script from levy import randlevy, seed gamma = 0.003501 alpha = 1.341702 seed(1235) x = randlevy(100000, alpha, gamma) print min(x), max(x) Seeding is optional - the rng is seeded by the clock if you don't explicitly call seed. Below is the module code which you can use as a template if you're interested. I think it would be nice to extend this code to include all the gsl distributions, but haven't taken the time. pygsl wraps these distributions, but suffers from the repeated call overhead that is hurting you. It would be a natural extension of pygsl to include initializing numeric arrays from the distributions. #include "Python.h" #include "Numeric/arrayobject.h" #include <stdio.h> #include <time.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> #include <gsl/gsl_errno.h> static PyObject *ErrorObject; // get x[i] from a 1d array of type xtype #define get1d(x,i,xtype) \ *(xtype *)(x->data+i*x->strides[0]) int _levymodRNGSeeded = 0; gsl_rng * _levymodRNG; static PyObject * seed(PyObject *self, PyObject *args) { const gsl_rng_type * T; int N; T = gsl_rng_default; _levymodRNG = gsl_rng_alloc(T); gsl_rng_env_setup(); if (args==Py_None) { //printf("Default seed\n"); time_t t1; (void) time(&t1); srand48((long) t1); N = (int)t1; } else if (!PyArg_ParseTuple(args, "i", &N)) return NULL; //printf("Setting seed %d\n", N); gsl_rng_set (_levymodRNG, N); _levymodRNGSeeded = 1; Py_INCREF(Py_None); return Py_None; } static PyObject * randlevy(PyObject *self, PyObject *args) { if (_levymodRNGSeeded == 0) { seed(self, Py_None); } int dimensions[1]; int i; PyArrayObject *x; int N; float alpha; float gamma; if (!PyArg_ParseTuple(args, "iff", &N, &alpha, &gamma)) return NULL; dimensions[0] = N; x = (PyArrayObject *)PyArray_FromDims(1,dimensions,PyArray_FLOAT); for (i=0; i<N; ++i) { get1d(x,i,float) = gsl_ran_levy(_levymodRNG, gamma, alpha); } return Py_BuildValue("N", x); } static struct PyMethodDef _levy_methods[] = { {"randlevy", (PyCFunction)randlevy, METH_VARARGS, "Return a numeric array of levy numbers."}, {"seed", (PyCFunction)seed, METH_VARARGS, "Seed the random number generator."}, {NULL, NULL} /* sentinel */ }; DL_EXPORT(void) init_levy(void) { PyObject *m, *d; /* Create the module and add the functions */ m = Py_InitModule4("_levy", _levy_methods, "Fill Numeric arrays with random numbers ", (PyObject*)NULL,PYTHON_API_VERSION); /* Import the array object */ import_array(); /* Add some symbolic constants to the module */ d = PyModule_GetDict(m); ErrorObject = PyString_FromString("_levy.error"); PyDict_SetItemString(d, "error", ErrorObject); /* Check for errors */ if (PyErr_Occurred()) Py_FatalError("can't initialize module _levy"); }
Christopher Fonnesbeck wrote:
I am already using pieces of SciPy in my Markov chain Monte Carlo package (PyMC), mostly for plotting functionality. I would also like to exploit the distributions implemented in scipy.stats, but they are far too slow for use in statistical simulation applications like MCMC, where millions of random draws may be taken. Therefore, I am thinking of implementing many of these distributions (at least the common ones) as C or Fortran extensions. I am unsure whether to use Fortran through f2py for this task, or C through weave.inline (for example). I have used both in the past for various tasks, and was generally happy with both. Any suggestions?
Could you specify which ones are too slow? This is a rather broad statement as many are implemented in C and are very fast. Some distributions, however, do default to using a numerical solver to invert the cdf and apply this to uniform random variates. You can improve the speed of these distributions by overriding the _ppf method or the _rvs method of the object to use a faster, more specialized method. I would use weave or fortran with f2py to do this. Best, -Travis O.
Thanks, C. -- Christopher J. Fonnesbeck ( c h r i s @ f o n n e s b e c k . o r g ) Georgia Cooperative Fish & Wildlife Research Unit, University of Georgia
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.net http://www.scipy.net/mailman/listinfo/scipy-user
On Jan 29, 2004, at 2:17 PM, Travis Oliphant wrote:
Christopher Fonnesbeck wrote:
I am already using pieces of SciPy in my Markov chain Monte Carlo package (PyMC), mostly for plotting functionality. I would also like to exploit the distributions implemented in scipy.stats, but they are far too slow for use in statistical simulation applications like MCMC, where millions of random draws may be taken. Therefore, I am thinking of implementing many of these distributions (at least the common ones) as C or Fortran extensions. I am unsure whether to use Fortran through f2py for this task, or C through weave.inline (for example). I have used both in the past for various tasks, and was generally happy with both. Any suggestions?
Could you specify which ones are too slow? This is a rather broad statement as many are implemented in C and are very fast. Some distributions, however, do default to using a numerical solver to invert the cdf and apply this to uniform random variates. You can improve the speed of these distributions by overriding the _ppf method or the _rvs method of the object to use a faster, more specialized method. I would use weave or fortran with f2py to do this.
Best,
-Travis O.
Well, the binomial and normal distributions, for sure, off the top of my head. Using the scipy distributions slows my MCMC code down significantly (they were the bottleneck, according to the profiling module). Using Fortran via f2py sped things up a lot. I'm not talking about the generation of random deviates, necessarily, but rather the pdf's, which are used for calculating likelihoods. C. -- Christopher J. Fonnesbeck ( c h r i s @ f o n n e s b e c k . o r g ) Georgia Cooperative Fish & Wildlife Research Unit, University of Georgia
On Thu, Jan 29, 2004 at 03:53:10PM -0500, Christopher Fonnesbeck wrote: [snip]
Well, the binomial and normal distributions, for sure, off the top of my head. Using the scipy distributions slows my MCMC code down significantly (they were the bottleneck, according to the profiling module). Using Fortran via f2py sped things up a lot. I'm not talking about the generation of random deviates, necessarily, but rather the pdf's, which are used for calculating likelihoods.
I'm willing to bet that the speed is lost by the generality of the framework. There's a lot of argument manipulation and not a lot of actual computation. It might be useful to add a fastdists module that computes pdfs and cdfs in a more specific (but fast) way for the more common distributions. And log-pdfs, too, please! In fact, log-pdfs first since in many instances like MCMC, they're much more suitable. One could then override pdf() and cdf() (not _pdf() and _cdf()) in the specific classes. But that's a design that appears to have been decided against.
C.
-- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert Kern wrote:
On Thu, Jan 29, 2004 at 03:53:10PM -0500, Christopher Fonnesbeck wrote:
[snip]
Well, the binomial and normal distributions, for sure, off the top of my head. Using the scipy distributions slows my MCMC code down significantly (they were the bottleneck, according to the profiling module). Using Fortran via f2py sped things up a lot. I'm not talking about the generation of random deviates, necessarily, but rather the pdf's, which are used for calculating likelihoods.
I'm willing to bet that the speed is lost by the generality of the framework. There's a lot of argument manipulation and not a lot of actual computation. It might be useful to add a fastdists module that computes pdfs and cdfs in a more specific (but fast) way for the more common distributions. And log-pdfs, too, please! In fact, log-pdfs first since in many instances like MCMC, they're much more suitable.
One could then override pdf() and cdf() (not _pdf() and _cdf()) in the specific classes. But that's a design that appears to have been decided against.
I wouldn't say anything has been decided against at this point. We could pretty easily over-ride pdf and cdf for whatevern distribution you like. Perhaps it would be a good idea to do so for the most common pdfs. -Travis O.
On Jan 29, 2004, at 8:17 PM, Travis Oliphant wrote:
Robert Kern wrote:
On Thu, Jan 29, 2004 at 03:53:10PM -0500, Christopher Fonnesbeck wrote:
[snip]
Well, the binomial and normal distributions, for sure, off the top of my head. Using the scipy distributions slows my MCMC code down significantly (they were the bottleneck, according to the profiling module). Using Fortran via f2py sped things up a lot. I'm not talking about the generation of random deviates, necessarily, but rather the pdf's, which are used for calculating likelihoods.
I'm willing to bet that the speed is lost by the generality of the framework. There's a lot of argument manipulation and not a lot of actual computation. It might be useful to add a fastdists module that computes pdfs and cdfs in a more specific (but fast) way for the more common distributions. And log-pdfs, too, please! In fact, log-pdfs first since in many instances like MCMC, they're much more suitable.
One could then override pdf() and cdf() (not _pdf() and _cdf()) in the specific classes. But that's a design that appears to have been decided against.
I wouldn't say anything has been decided against at this point. We could pretty easily over-ride pdf and cdf for whatevern distribution you like. Perhaps it would be a good idea to do so for the most common pdfs. -Travis O.
Perhaps I will take a shot at this, then, since I already have several Fortran extensions written for log-likelihoods. C. -- Christopher J. Fonnesbeck ( c h r i s @ f o n n e s b e c k . o r g ) Georgia Cooperative Fish & Wildlife Research Unit, University of Georgia
participants (5)
-
Christopher Fonnesbeck
-
John Hunter
-
Robert Kern
-
Travis Oliphant
-
Travis Oliphant