
I ported an algorithm originally written in MatLab to Python for generating a random contingency table (when one exists) for a statistical test I am writing. I know something like this doesn't exist in scipy.stats.contingency and was wondering if it would be a good addition? I haven't developed any tests for the function yet but can if needed. I have just started developing in Python as an undergraduate student studying computational biology. This would be my first time contributing to open source and I am looking forward to the the experience and all the advice you are willing to offer. Thanks for your time. -Clint def rcont(ncol, nrow, ncolt, nrowt): """ Return a random contingency table based on the two-dimensional shape and marginal totals of a given contingency table There must be 1 or many tables with nonnegative, integral entries that have the given shape and marginal sums as input. The function will generate, at random, one of the tables and return it. Repeated calls to the function will return new random selections. Parameters: ----------- ncol : integer The vector length of the columns in the contingency table nrow : integer The vector length of the rows in the contingency table ncolt : array_like The vector of the marginal column totals in the contingency table nrowt : array_like The vector of the marginal row totals in the contingency table Returns ------- imatrix : ndarray of integers The randomly generated two-dimensional ndarray preserving the same shape and marginal totals provided if exists Reference --------- Algorithm AS144: Appl. Statist. (1979) v.28, pp.329-332 Licensing: GNU LGPL license Author: Original FORTRAN77 version by James Boyett. MATLAB version by John Burkardt. Python version by Clint Valentine Reference: James Boyett, Algorithm AS 144: Random R x C Tables with Given Row and Column Totals, Applied Statistics, Volume 28, Number 3, pages 329-332, 1979. """ # Initialize and permute vector nvect = np.array(xrange(sum(ncolt)), dtype=np.int_) np.random.shuffle(nvect) # Construct random matrix and partial column sums imatrix = np.zeros([nrow, ncol], dtype=np.int_) nsubt = np.cumsum(ncolt) # Generate random RxC contingency tables preserving shape and # marginal column and row sums and total sum. count = 0 for i in xrange(nrow): limit = nrowt[i] for _ in xrange(limit): for j in xrange(ncol): if nvect[count] <= nsubt[j]: count += 1 imatrix[i][j] += 1 break return imatrix

On Mon, Dec 22, 2014 at 7:47 PM, Clint Valentine <valentine.c@husky.neu.edu> wrote:
I ported an algorithm originally written in MatLab to Python for generating a random contingency table (when one exists) for a statistical test I am writing. I know something like this doesn't exist in scipy.stats.contingency and was wondering if it would be a good addition? I haven't developed any tests for the function yet but can if needed.
I have just started developing in Python as an undergraduate student studying computational biology. This would be my first time contributing to open source and I am looking forward to the the experience and all the advice you are willing to offer. Thanks for your time.
Hi Clint, More tools for the analysis of contigency tables, or other missing pieces in stats, will be useful and will find a home in either scipy.stats or statsmodels. Specifically to this: I never looked at random generation of contingency tables, but I think it would be useful for bootstrap or Monte Carlo based hypothesis tests, which are currently not yet available in scipy.stats but some are in other pacakges. However, scipy, and statsmodels and most other packages in this neighborhood, are BSD licensed and GPL is not compatible with it. So, we need to either translate code that is license compatible (like most of the code on the matlab fileexchange), or write the code based on the description of the algorithm without copying/translating code that has an incompatible license.. Josef
-Clint
def rcont(ncol, nrow, ncolt, nrowt): """ Return a random contingency table based on the two-dimensional shape and marginal totals of a given contingency table
There must be 1 or many tables with nonnegative, integral entries that have the given shape and marginal sums as input. The function will generate, at random, one of the tables and return it. Repeated calls to the function will return new random selections.
Parameters: ----------- ncol : integer The vector length of the columns in the contingency table
nrow : integer The vector length of the rows in the contingency table
ncolt : array_like The vector of the marginal column totals in the contingency table
nrowt : array_like The vector of the marginal row totals in the contingency table
Returns ------- imatrix : ndarray of integers The randomly generated two-dimensional ndarray preserving the same shape and marginal totals provided if exists
Reference --------- Algorithm AS144: Appl. Statist. (1979) v.28, pp.329-332
Licensing: GNU LGPL license
Author: Original FORTRAN77 version by James Boyett. MATLAB version by John Burkardt. Python version by Clint Valentine
Reference: James Boyett, Algorithm AS 144: Random R x C Tables with Given Row and Column Totals, Applied Statistics, Volume 28, Number 3, pages 329-332, 1979. """ # Initialize and permute vector nvect = np.array(xrange(sum(ncolt)), dtype=np.int_) np.random.shuffle(nvect)
# Construct random matrix and partial column sums imatrix = np.zeros([nrow, ncol], dtype=np.int_) nsubt = np.cumsum(ncolt)
# Generate random RxC contingency tables preserving shape and # marginal column and row sums and total sum. count = 0 for i in xrange(nrow): limit = nrowt[i] for _ in xrange(limit): for j in xrange(ncol): if nvect[count] <= nsubt[j]: count += 1 imatrix[i][j] += 1 break return imatrix
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

As Joseph said, we cannot have LGPL code in SciPy, but we can use the original paper as source. The license must be SciPy's version of the BSD license, or at least compatible. Some comments on your code: The inner loop should be vectorized. If not possible (I have not looked at it carefully) you should consider to use Cython or Fortran (f2py). imatrix[i][j] is a common beginner's mistake. Use imatrix[i,j] (yes it matters) or preferably vectorize (cf. comment above). The function should take a RandomState as an optional argument. For Monte Carlo, it would be nice to have a repeats argument as well, to squash some of the Python overhead. If you add a repeats argument it is better to use Cython or Fortran because numpy.random.shuffle and numpy.random.permutation do not take an axis argument, hence you cannot vectorize the suffling. (This is a problem in NumPy we should fix.) All in all, I would say this points to Cython based solution. Sturla On 23/12/14 01:47, Clint Valentine wrote:
I ported an algorithm originally written in MatLab to Python for generating a random contingency table (when one exists) for a statistical test I am writing. I know something like this doesn't exist in scipy.stats.contingency and was wondering if it would be a good addition? I haven't developed any tests for the function yet but can if needed.
I have just started developing in Python as an undergraduate student studying computational biology. This would be my first time contributing to open source and I am looking forward to the the experience and all the advice you are willing to offer. Thanks for your time.
-Clint
def rcont(ncol, nrow, ncolt, nrowt): """ Return a random contingency table based on the two-dimensional shape and marginal totals of a given contingency table
There must be 1 or many tables with nonnegative, integral entries that have the given shape and marginal sums as input. The function will generate, at random, one of the tables and return it. Repeated calls to the function will return new random selections.
Parameters: ----------- ncol : integer The vector length of the columns in the contingency table
nrow : integer The vector length of the rows in the contingency table
ncolt : array_like The vector of the marginal column totals in the contingency table
nrowt : array_like The vector of the marginal row totals in the contingency table
Returns ------- imatrix : ndarray of integers The randomly generated two-dimensional ndarray preserving the same shape and marginal totals provided if exists
Reference --------- Algorithm AS144: Appl. Statist. (1979) v.28, pp.329-332
Licensing: GNU LGPL license
Author: Original FORTRAN77 version by James Boyett. MATLAB version by John Burkardt. Python version by Clint Valentine
Reference: James Boyett, Algorithm AS 144: Random R x C Tables with Given Row and Column Totals, Applied Statistics, Volume 28, Number 3, pages 329-332, 1979. """ # Initialize and permute vector nvect = np.array(xrange(sum(ncolt)), dtype=np.int_) np.random.shuffle(nvect)
# Construct random matrix and partial column sums imatrix = np.zeros([nrow, ncol], dtype=np.int_) nsubt = np.cumsum(ncolt)
# Generate random RxC contingency tables preserving shape and # marginal column and row sums and total sum. count = 0 for i in xrange(nrow): limit = nrowt[i] for _ in xrange(limit): for j in xrange(ncol): if nvect[count] <= nsubt[j]: count += 1 imatrix[i][j] += 1 break return imatrix
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
participants (3)
-
Clint Valentine
-
josef.pktd@gmail.com
-
Sturla Molden