Alternative to R phyper

Hello everyone, I have a bit of code where I am using rpy2 to import R phyper so I can perform an hypergeometric test. Unfortunately our cluster does not have a functional installation of rpy2 working. So I am wondering if I could translate to scipy which would make the code completly independent of R. The python code I have is as following: def lphyper(self,q,m,n,k): """ self.phyper(self,q,m,n,k) Calculate p-value using R function phyper from rpy2 low-level interface. "R Documentation phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) q: vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls. m: the number of white balls in the urn. n: the number of black balls in the urn. k: the number of balls drawn from the urn. log.p: logical; if TRUE, probabilities p are given as log(p). lower.tail: logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x]. """ phyper_q = SexpVector([q,], rinterface.INTSXP) phyper_m = SexpVector([m,], rinterface.INTSXP) phyper_n = SexpVector([n,], rinterface.INTSXP) phyper_k = SexpVector([k,], rinterface.INTSXP) return phyper(phyper_q,phyper_m,phyper_n,phyper_k,**myparams)[0] I have looked to scipy.stats.hypergeom but it is giving me a different result which is also negative.
1-phyper(45, 92, 7518, 1329) [1] 6.92113e-13
In [24]: stats.hypergeom.sf(45,(92+7518),92,1329) Out[24]: -8.4343643180773142e-12 This was supposed to be an error with an older version of scipy but I am using more recent versions of it which should not contain the error anymore: In [26]: numpy.__version__ Out[26]: '1.5.1' In [27]: scipy.__version__ Out[27]: '0.9.0' thank you very much in advance for any help. Best, Bruno

On Mon, Apr 30, 2012 at 12:09, Bruno Santos <bacmsantos@gmail.com> wrote:
Hello everyone,
I have a bit of code where I am using rpy2 to import R phyper so I can perform an hypergeometric test. Unfortunately our cluster does not have a functional installation of rpy2 working. So I am wondering if I could translate to scipy which would make the code completly independent of R. The python code I have is as following:
def lphyper(self,q,m,n,k): """ self.phyper(self,q,m,n,k) Calculate p-value using R function phyper from rpy2 low-level interface. "R Documentation phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) q: vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls. m: the number of white balls in the urn. n: the number of black balls in the urn. k: the number of balls drawn from the urn. log.p: logical; if TRUE, probabilities p are given as log(p). lower.tail: logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x]. """ phyper_q = SexpVector([q,], rinterface.INTSXP) phyper_m = SexpVector([m,], rinterface.INTSXP) phyper_n = SexpVector([n,], rinterface.INTSXP) phyper_k = SexpVector([k,], rinterface.INTSXP) return phyper(phyper_q,phyper_m,phyper_n,phyper_k,**myparams)[0]
I have looked to scipy.stats.hypergeom but it is giving me a different result which is also negative.
1-phyper(45, 92, 7518, 1329) [1] 6.92113e-13
In [24]: stats.hypergeom.sf(45,(92+7518),92,1329) Out[24]: -8.4343643180773142e-12
The CDF (CMF? whatever) for stats.hypergeom is not implemented explicitly, so it falls back to the default implementation of just summing up the PMF. You are falling victim to floating point error accumulation such that the sum exceeds 1.0, so the survival function 1-CDF ends up negative. I don't think we have a better implementation of the CDF anywhere in scipy, sorry. -- Robert Kern

On Mon, Apr 30, 2012 at 7:27 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Mon, Apr 30, 2012 at 12:09, Bruno Santos <bacmsantos@gmail.com> wrote:
Hello everyone,
I have a bit of code where I am using rpy2 to import R phyper so I can perform an hypergeometric test. Unfortunately our cluster does not have a functional installation of rpy2 working. So I am wondering if I could translate to scipy which would make the code completly independent of R. The python code I have is as following:
def lphyper(self,q,m,n,k): """ self.phyper(self,q,m,n,k) Calculate p-value using R function phyper from rpy2 low-level interface. "R Documentation phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) q: vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls. m: the number of white balls in the urn. n: the number of black balls in the urn. k: the number of balls drawn from the urn. log.p: logical; if TRUE, probabilities p are given as log(p). lower.tail: logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x]. """ phyper_q = SexpVector([q,], rinterface.INTSXP) phyper_m = SexpVector([m,], rinterface.INTSXP) phyper_n = SexpVector([n,], rinterface.INTSXP) phyper_k = SexpVector([k,], rinterface.INTSXP) return phyper(phyper_q,phyper_m,phyper_n,phyper_k,**myparams)[0]
I have looked to scipy.stats.hypergeom but it is giving me a different result which is also negative.
1-phyper(45, 92, 7518, 1329) [1] 6.92113e-13
In [24]: stats.hypergeom.sf(45,(92+7518),92,1329) Out[24]: -8.4343643180773142e-12
the corrected version with explicit sf was added in 0.10
from scipy import stats stats.hypergeom.sf(45,(92+7518),92,1329) 6.9212632647221852e-13
import scipy scipy.__version__ '0.10.0b2'
Josef
The CDF (CMF? whatever) for stats.hypergeom is not implemented explicitly, so it falls back to the default implementation of just summing up the PMF. You are falling victim to floating point error accumulation such that the sum exceeds 1.0, so the survival function 1-CDF ends up negative. I don't think we have a better implementation of the CDF anywhere in scipy, sorry.
-- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Bruno Santos
-
josef.pktd@gmail.com
-
Robert Kern