problems with stats
Howdy, I've encountered two problems with the binomial distribution in the stats module. The first is trivial and I believe just an argument checking problem.
binom.pdf(0,3,.5)
binom.ppf(3,4,.95) Traceback (most recent call last): File "<stdin>", line 1, in ? File ".\site-packages\scipy\stats\distributions.py", line 2371, in
bdtr domain error 1.#QNAN (the answer should be .125 of course) The second is a little more significant. I'm trying to invert the binomial distribution using ppf. I looked at distributions.py class binom_gen(rv_discrete): . . . def _ppf(self, q, n, pr): vals = ceil(special.bdtrik(q,n,pr)) vals1 = vals-1 temp = special.bdtr(vals1,n,pr) return where(temp >= q, vals1, vals) and ppf is in fact implemented, but when I try to run it I get an error. ppf TypeError: _ppf() takes exactly 4 arguments (3 given) Does anyone know what's going on. Am I misusing something? thanks, Danny __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
danny shevitz wrote:
Howdy,
I've encountered two problems with the binomial distribution in the stats module. The first is trivial and I believe just an argument checking problem.
binom.pdf(0,3,.5)
bdtr domain error 1.#QNAN
which version are you using? The current version 0.3 uses pmf (probability mass function) for discrete rvs. for me
binom.pmf(0,3,.5) 0.125
(the answer should be .125 of course)
The second is a little more significant. I'm trying to invert the binomial distribution using ppf. I looked at distributions.py
class binom_gen(rv_discrete): . . . def _ppf(self, q, n, pr): vals = ceil(special.bdtrik(q,n,pr)) vals1 = vals-1 temp = special.bdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
and ppf is in fact implemented, but when I try to run it I get an error.
binom.ppf(3,4,.95)
You uncovered a genuine bug here. But, by the way what you expect this to return seeing how 3 is bigger than 1. -Travis
I'm using .2. I'm trying to get a release out the door and haven't had the chance to download .3 yet (it's only been out a week, so I beg for an indulgence). As for the binom.ppf(). Here's what I was trying to do. It's a fairly standard problem. I am running a monte carlo simulation to try to determinine the probability of an event happening and put confidence bounds on the answer. The probability is relatively low e.g. 10e-6. I'm running approximately 10^7 trials and getting approximately 10 events. There is a standard way to assign the error bars using the continuous F distribution, but when I looked at the stats documentation I noticed the binomial distribution could be inverted via ppf. That's not to say I was using it correctly, which you graciously pointed out I was not. Of course there is no documentation either so I had to guess from the code :-). In short I was just trying to play with the binomial distribution to see if I could get it do what I wanted directly. The short version of the problem that I was trying to solve is: Given N trials and n events, what is the specified confidence interval for the binomial probability p. The hypothetical argument call should therefore take 2 integers and a probability, e.g. f(10, 10000000, .95) The third number is a confidence, not the probability in the binomial distribution. Of course there would have to be a pair of such functions ( one for the lower bound, and one for the upper). Like I said, what I was doing, wasn't particularly well thought out. I was just playing around, and using the function wrong anyway... BTW I guess I have no idea what binom.ppf is supposed to do then. Why do you need two probilities? Is one a confidence and the other the probability. Also in case it wasn't obvious, I'm not a statistician by training :-) D
The second is a little more significant. I'm trying to invert the binomial distribution using ppf. I looked at distributions.py
class binom_gen(rv_discrete): . . . def _ppf(self, q, n, pr): vals = ceil(special.bdtrik(q,n,pr)) vals1 = vals-1 temp = special.bdtr(vals1,n,pr) return where(temp >= q, vals1, vals)
and ppf is in fact implemented, but when I try to run it I get an error.
binom.ppf(3,4,.95)
You uncovered a genuine bug here.
But, by the way what you expect this to return seeing how 3 is bigger
than 1.
-Travis
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.net http://www.scipy.net/mailman/listinfo/scipy-user
__________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
danny shevitz wrote:
The short version of the problem that I was trying to solve is: Given N trials and n events, what is the specified confidence interval for the binomial probability p.
The hypothetical argument call should therefore take 2 integers and a probability, e.g. f(10, 10000000, .95) The third number is a confidence, not the probability in the binomial distribution. Of course there would have to be a pair of such functions ( one for the lower bound, and one for the upper).
What you are looking for is the beta distribution. Here's the Bayesian analysis: We have n Bernoulli trials and r successes with an unknown chance (theta) of success on any given trial. The likelihood is P(n,r|theta) = theta**r * (1-theta)**(n-r) The posterior probability is P(theta|n,r) = A * P(n,r|theta) * P(theta) A is a normalizing constant and P(theta) is your prior probability for theta. For now, let's use a minimally informative prior in order to make the fewest assumptions. P(theta) o< 1/(theta*(1-theta)) The o< symbol is the proportionality symbol. This prior is improper and cannot be normalized as it stands. When we put it together with the likelihood, as long as r does not equal either 0 or n, the posterior will come out normalized. There are good reasons which I won't go into now to use this instead of P(theta)=1, but you can use that instead and redo the math that leads to the following (easy!). So, P(theta|n,r) = A * theta**(r-1) * (1-theta)**(n-r-1) We can recognize this as the Beta distribution, Beta(r,n-r) (or Beta(r+1,n-r+1) if you're queasy about improper priors). special.bdtri(10, 10000000, 0.95) will give you the one-sided 95% credible interval (like a confidence interval, only Bayesian). Since this distribution isn't symmetric, I'm not entirely sure how to get the highest posterior density (HPD) credible interval around the peak. And I'm not entirely sure how accurate the function is with those extreme inputs. This analysis is lifted straight from Chapter 6 of the excellent book _Probability Theory: The Logic of Science_ by E. T. Jaynes (Cambridge Press, 2003).
Like I said, what I was doing, wasn't particularly well thought out. I was just playing around, and using the function wrong anyway...
BTW I guess I have no idea what binom.ppf is supposed to do then. Why do you need two probilities? Is one a confidence and the other the probability.
The random variable in the binomial distribution is the number of successes given the probability p of individual success and the number of trials. The point mass function would invert for n, the number of successes that, by summing the probability masses for each number of successes between 0 and n, gives you 0.95 (for example). So yes, one input is the confidence level, one is the probability of individual success, and you also need the number of trials.
Also in case it wasn't obvious, I'm not a statistician by training :-)
D
-- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert Kern wrote:
The random variable in the binomial distribution is the number of successes given the probability p of individual success and the number of trials. The point mass function would invert for n, the number of ^^^^^^^^^^^^^^^^^^^ That's "percent point function", you silly idiot!
-- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert, thanks for the response. I haven't forgotten you wrote, I just haven't had a chance to really dig into what you wrote. Now I have. So I have been using the frequentist approach to constructing the error bars and I understand that. It involves taking some linear fractional transformations of F statistics. After reading your letter, I was able to compute the same error bounds directly. The key was the bdtri function as you mentioned. pLower = bdtri(n-1, N, .95) pUpper = bdtri(n, N, .05) for example gives the .90 confidence interval. This is the pair of functions I needed. The n-1 in pLower is because the value n is considered in the tail for the purposes of computing the confidences. I didn't have a prayer in scipy.stats because bdtri is not wrapped. Its totally different than ppf, which still requires the unknown probability in order to construct the distribution at all. So I can get the frequentist result directly, now I'm curious about your Bayesian analysis. One of the uniform or improper prior gives the same answer. BTW, I'm also curious if you left off the combinatoric factor (n choose r) in your description. Can you give me a little of the rational for choosing the improper prior? thanks, Danny --- Robert Kern <rkern@ucsd.edu> wrote:
danny shevitz wrote:
The short version of the problem that I was trying to solve is: Given N trials and n events, what is the specified confidence interval for the binomial probability p.
The hypothetical argument call should therefore take 2 integers and a probability, e.g. f(10, 10000000, .95) The third number is a confidence, not the probability in the binomial distribution. Of course there would have to be a pair of such functions ( one for the lower bound, and one for the upper).
What you are looking for is the beta distribution. Here's the Bayesian analysis:
We have n Bernoulli trials and r successes with an unknown chance (theta) of success on any given trial. The likelihood is
P(n,r|theta) = theta**r * (1-theta)**(n-r)
The posterior probability is
P(theta|n,r) = A * P(n,r|theta) * P(theta)
A is a normalizing constant and P(theta) is your prior probability for theta. For now, let's use a minimally informative prior in order to make the fewest assumptions.
P(theta) o< 1/(theta*(1-theta))
The o< symbol is the proportionality symbol. This prior is improper and cannot be normalized as it stands. When we put it together with the likelihood, as long as r does not equal either 0 or n, the posterior will come out normalized. There are good reasons which I won't go into now to use this instead of P(theta)=1, but you can use that instead and redo the math that leads to the following (easy!).
So, P(theta|n,r) = A * theta**(r-1) * (1-theta)**(n-r-1)
We can recognize this as the Beta distribution, Beta(r,n-r) (or Beta(r+1,n-r+1) if you're queasy about improper priors).
special.bdtri(10, 10000000, 0.95) will give you the one-sided 95% credible interval (like a confidence interval, only Bayesian). Since this distribution isn't symmetric, I'm not entirely sure how to get the highest posterior density (HPD) credible interval around the peak. And I'm not entirely sure how accurate the function is with those extreme
inputs.
This analysis is lifted straight from Chapter 6 of the excellent book
_Probability Theory: The Logic of Science_ by E. T. Jaynes (Cambridge
Press, 2003).
Like I said, what I was doing, wasn't particularly well thought out. I was just playing around, and using the function wrong anyway...
BTW I guess I have no idea what binom.ppf is supposed to do then. Why do you need two probilities? Is one a confidence and the other the probability.
The random variable in the binomial distribution is the number of successes given the probability p of individual success and the number of trials. The point mass function would invert for n, the number of successes that, by summing the probability masses for each number of successes between 0 and n, gives you 0.95 (for example). So yes, one input is the confidence level, one is the probability of individual success, and you also need the number of trials.
Also in case it wasn't obvious, I'm not a statistician by training :-)
D
-- Robert Kern rkern@ucsd.edu
"In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.net http://www.scipy.net/mailman/listinfo/scipy-user
__________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
sorry, for the post, I meant to send it to Robert not the group. Danny __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover
participants (3)
-
danny shevitz -
Robert Kern -
Travis E. Oliphant