numpy.power -> numpy.random.choice Probabilities don't sum to 1

Hi, I have a matrix whose entries I must raise to a certain power and then normalize by row. After I do that, when I pass some rows to numpy.random.choice, I get a ValueError: probabilities do not sum to 1. I understand that floating point is not perfect, and my matrix is so large that I cannot use np.longdouble because I will run out of RAM. As an example on a smaller matrix: np.power(mymatrix, 10, out=mymatrix) row_normalized = np.apply_along_axis(lambda x: x / np.sum(x), 1, mymatrix) sums = row_normalized.sum(axis=1) sums[np.where(sums != 1)] array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994, 0.99999994, 0.99999994], dtype=float32) np.random.choice(range(row_normalized.shape[0]), 1, p=row_normalized[0, :]) … ValueError: probabilities do not sum to 1 I also tried the normalize function in sklearn.preprocessing and have the same problem. Is there a way to avoid this problem without having to make manual adjustments to get the row sums to = 1? — Ryan

On Fri, Dec 18, 2015 at 1:25 PM, Ryan R. Rosario <ryan@bytemining.com> wrote:
I'm sorry I don't have a solution to your actual problem off the top of my head, but it's probably helpful in general to know that a better way to write this would be just row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True) apply_along_axis is slow and can almost always be replaced by a broadcasting expression like this.
sums = row_normalized.sum(axis=1) sums[np.where(sums != 1)]
And here you can just write sums[sums != 1] i.e. the call to where() isn't doing anything useful. -n -- Nathaniel J. Smith -- http://vorpus.org

A simple fix would certainly by pass the check in random.choice, but I don't know how to get that. So let's focus on the summation. I believe you are hitting an instability in summing small numbers as a power to 10th order would produce. Here is an example: mymatrix = np.random.rand(1024,1024).astype('float16')*1e-7 row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True) sums = row_normalized.sum(axis=1) len(sums[sums != 1]) # -> 108 One can use things like Kahan summation and you will need to collect the size of the error and truncate all numbers in mymatrix under that error. I'm not quite sure how to quickly implement such a thing in numpy without a loop. On Fri, Dec 18, 2015 at 7:00 PM, Nathaniel Smith <njs@pobox.com> wrote:

On Sa, 2015-12-19 at 06:55 -0600, Andy Ray Terrel wrote:
In fact, the code even seems to do kahan summation, however, I think it always assumes double precision for the p keyword argument, so as a work around at least, you have to sum to convert to and normalize it as double. - Sebastian

On Fri, Dec 18, 2015 at 1:25 PM, Ryan R. Rosario <ryan@bytemining.com> wrote:
I'm sorry I don't have a solution to your actual problem off the top of my head, but it's probably helpful in general to know that a better way to write this would be just row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True) apply_along_axis is slow and can almost always be replaced by a broadcasting expression like this.
sums = row_normalized.sum(axis=1) sums[np.where(sums != 1)]
And here you can just write sums[sums != 1] i.e. the call to where() isn't doing anything useful. -n -- Nathaniel J. Smith -- http://vorpus.org

A simple fix would certainly by pass the check in random.choice, but I don't know how to get that. So let's focus on the summation. I believe you are hitting an instability in summing small numbers as a power to 10th order would produce. Here is an example: mymatrix = np.random.rand(1024,1024).astype('float16')*1e-7 row_normalized = mymatrix / np.sum(mymatrix, axis=1, keepdims=True) sums = row_normalized.sum(axis=1) len(sums[sums != 1]) # -> 108 One can use things like Kahan summation and you will need to collect the size of the error and truncate all numbers in mymatrix under that error. I'm not quite sure how to quickly implement such a thing in numpy without a loop. On Fri, Dec 18, 2015 at 7:00 PM, Nathaniel Smith <njs@pobox.com> wrote:

On Sa, 2015-12-19 at 06:55 -0600, Andy Ray Terrel wrote:
In fact, the code even seems to do kahan summation, however, I think it always assumes double precision for the p keyword argument, so as a work around at least, you have to sum to convert to and normalize it as double. - Sebastian
participants (4)
-
Andy Ray Terrel
-
Nathaniel Smith
-
Ryan R. Rosario
-
Sebastian Berg