Re: [Numpy-discussion] making "low" optional in numpy.randint

Hello all, I have a PR open here <https://github.com/numpy/numpy/pull/7151> that makes "low" an optional parameter in numpy.randint and introduces new behavior into the API as follows: 1) `low == None` and `high == None` Numbers are generated over the range `[lowbnd, highbnd)`, where `lowbnd = np.iinfo(dtype).min`, and `highbnd = np.iinfo(dtype).max`, where `dtype` is the provided integral type. 2) `low != None` and `high == None` If `low >= 0`, numbers are <b>still</b> generated over the range `[0, low)`, but if `low` < 0, numbers are generated over the range `[low, highbnd)`, where `highbnd` is defined as above. 3) `low == None` and `high != None` Numbers are generated over the range `[lowbnd, high)`, where `lowbnd` is defined as above. The primary motivation was the second case, as it is more convenient to specify a 'dtype' by itself when generating such numbers in a similar vein to numpy.empty, except with initialized values. Looking forward to your feedback! Greg

Behavior of random integer generation: Python randint [a,b] MATLAB randi [a,b] Mma RandomInteger [a,b] haskell randomR [a,b] GAUSS rndi [a,b] Maple rand [a,b] In short, NumPy's `randint` is non-standard (and, I would add, non-intuitive). Presumably was due due to relying on a float draw from [0,1) along with the use of floor. The divergence in behavior between the (later) Python function of the same name is particularly unfortunate. So I suggest further work on this function is not called for, and use of `random_integers` should be encouraged. Probably NumPy's `randint` should be deprecated. If there is any playing with the interface, I think Mma provides a pretty good model. If I were designing the interface, I would always require a tuple argument (for the inclusive range), with possible `None` values to imply datatype extreme values. Proposed name (after `randint` deprecation): `randints`. Cheers, Alan Isaac

On Wed, Feb 17, 2016 at 4:40 PM, Alan Isaac <alan.isaac@gmail.com> wrote:
No, never was. It is implemented so because Python uses semi-open integer intervals by preference because it plays most nicely with 0-based indexing. Not sure about all of those systems, but some at least are 1-based indexing, so closed intervals do make sense. The Python stdlib's random.randint() closed interval is considered a mistake by python-dev leading to the implementation and preference for random.randrange() instead.
The divergence in behavior between the (later) Python function of the same name is particularly unfortunate.
Indeed, but unfortunately, this mistake dates way back to Numeric times, and easing the migration to numpy was a priority in the heady days of numpy 1.0.
Not while I'm here. Instead, `random_integers()` is discouraged and perhaps might eventually be deprecated. -- Robert Kern

Perhaps, but we are not coding in Haskell. We are coding in Python, and the standard is that the endpoint is excluded, which renders your point moot I'm afraid. On Wed, Feb 17, 2016 at 5:10 PM, Alan Isaac <alan.isaac@gmail.com> wrote:

On 2/17/2016 12:28 PM, G Young wrote:
I am not sure what "standard" you are talking about. I thought we were talking about the user interface. Nobody is proposing changing the behavior of `range`. That is an entirely separate question. I'm not trying to change any minds, but let's not rely on spurious arguments. Cheers, Alan

On Wed, Feb 17, 2016 at 8:30 PM, Alan Isaac <alan.isaac@gmail.com> wrote:
It is a persistent and consistent convention (i.e. "standard") across Python APIs that deal with integer ranges (range(), slice(), random.randrange(), ...), particularly those that end up related to indexing; e.g. `x[np.random.randint(0, len(x))]` to pull a random sample from an array. random.randint() was the one big exception, and it was considered a mistake for that very reason, soft-deprecated in favor of random.randrange(). -- Robert Kern

On 2/17/2016 3:42 PM, Robert Kern wrote:
randrange also has its detractors: https://code.activestate.com/lists/python-dev/138358/ and following. I think if we start citing persistant conventions, the persistent convention across *many* languages that the bounds provided for a random integer range are inclusive also counts for something, especially when the names are essentially shared. But again, I am just trying to be clear about what is at issue, not push for a change. I think citing non-existent standards is not helpful. I think the discrepancy between the Python standard library and numpy for a function going by a common name is harmful. (But then, I teach.) fwiw, Alan

Also fwiw, I think the 0-based, half-open interval is one of the best features of Python indexing and yes, I do use random integers to index into my arrays and would not appreciate having to litter my code with "-1" everywhere. On Thu, Feb 18, 2016 at 10:29 AM, Alan Isaac <alan.isaac@gmail.com> wrote:

Your statement is a little self-contradictory, but in any case, you shouldn't worry about random_integers getting removed from the code-base. However, it has been deprecated in favor of randint. On Wed, Feb 17, 2016 at 11:48 PM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:

On 2/17/2016 6:48 PM, Juan Nunez-Iglesias wrote:
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated /numpy.random.choice.html fwiw, Alan Isaac

On 2/17/2016 7:01 PM, Juan Nunez-Iglesias wrote:
Notice the limitation "1D array-like".
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.choi... "If an int, the random sample is generated as if a was np.arange(n)" hth, Alan Isaac

On Wed, Feb 17, 2016 at 7:17 PM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:
(un)related aside: my R doc quote about "may lead to undesired behavior" refers to this, IIRC, R's `sample` was the inspiration for this function but numpy distinguishes scalar from one element (1D) arrays
for i in range(3, 10): np.random.choice(np.arange(10)[i:])
Josef

Joe: fair enough. A separate function seems more reasonable. Perhaps it was a wording thing, but you kept saying "wrapper," which is not the same as a separate function. Josef: I don't think we are making people think more. They're all keyword arguments, so if you don't want to think about them, then you leave them as the defaults, and everyone is happy. The 'dtype' keyword was needed by someone who wanted to generate a large array of uint8 random integers and could not just as call 'astype' due to memory constraints. I would suggest you read this issue here <https://github.com/numpy/numpy/issues/6790> and the PR's that followed so that you have a better understanding as to why this 'weird' behavior was chosen. On Wed, Feb 17, 2016 at 8:30 PM, Alan Isaac <alan.isaac@gmail.com> wrote:

Josef: I don't think we are making people think more. They're all keyword arguments, so if you don't want to think about them, then you leave
On Wed, Feb 17, 2016 at 8:43 PM, G Young <gfyoung17@gmail.com> wrote: them as the defaults, and everyone is happy. I believe that Josef has the code's reader in mind, not the code's writer. As a reader of other people's code (and I count 6-months-ago-me as one such "other people"), I am sure to eventually encounter all of the different variants, so I will need to know all of them. -- Robert Kern

I sense that this issue is now becoming more of "randint has become too complicated" I suppose we could always "add" more functions that present simpler interfaces, though if you really do want simple, there's always Python's random library you can use. On Wed, Feb 17, 2016 at 8:48 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Wed, Feb 17, 2016 at 3:58 PM, G Young <gfyoung17@gmail.com> wrote:
I have mostly the users in mind (i.e. me). I like simple patterns where I don't have to stare at a docstring for five minutes to understand it, or pull it up again each time I use it. dtype for storage is different from dtype as distribution parameter. --- aside, since I just read this https://news.ycombinator.com/item?id=11112763 what to avoid. you save a few keystrokes and spend months trying to figure out what's going on. (exaggerated) "*Note* that this convenience feature may lead to undesired behaviour when ..." from R docs Josef

On Mi, 2016-02-17 at 20:48 +0000, Robert Kern wrote:
Completely agree. Greg, if you need more then a few minutes to explain it in this case, there seems little point. It seems to me even the worst cases of your examples would be covered by writing code like: np.random.randint(np.iinfo(np.uint8).min, 10, dtype=np.uint8) And *everyone* will immediately know what is meant with just minor extra effort for writing it. We should keep the analogy to "range" as much as possible. Anything going far beyond that, can be confusing. On first sight I am not convinced that there is a serious convenience gain by doing magic here, but this is a simple case: "Explicit is better then implicit" since writing the explicit code is easy. It might also create weird bugs if the completely unexpected (most users would probably not even realize it existed) happens and you get huge numbers because you happened to have a `low=0` in there. Especially your point 2) seems confusing. As for 3) if I see `np.random.randint(high=3)` I think I would assume [0, 3).... Additionally, I am not sure the maximum int range is such a common need anyway? - Sebastian

On Mi, 2016-02-17 at 22:10 +0100, Sebastian Berg wrote:
OK, that was silly, that is what happens of course. So it is explicit in the sense that you have pass in at least one `None` explicitly. But I am still not sure that the added convenience is big and easy to understand [1], if it was always lowest for low and highest for high, I remember get it, but it seems more complex (though None does also look a a bit like "default" and "default" is 0 for low). - Sebastian [1] As in the trade-off between added complexity vs. added convenience.

On Mi, 2016-02-17 at 21:53 +0000, G Young wrote:
"Explicit is better than implicit" - can't argue with that. It doesn't seem like the PR has gained much traction, so I'll close it.
Thanks for the effort though! Sometimes we get a bit carried away with doing fancy stuff, and I guess the idea is likely a bit too fancy for wide application. - Sebastian

On Wed, Feb 17, 2016 at 10:01 AM, G Young <gfyoung17@gmail.com> wrote:
My impression (*) is that this will be confusing, and uses a default that I never ever needed. Maybe a better way would be to use low=-np.inf and high=np.inf where inf would be interpreted as the smallest and largest representable number. And leave the defaults unchanged. (*) I didn't try to understand how it works for various cases. Josef

On Wed, Feb 17, 2016 at 1:37 PM, <josef.pktd@gmail.com> wrote:
As I mentioned on the PR discussion, the thing that bothers me is the inconsistency between the new and the old functionality, specifically in #2. If high is, the behavior is completely different depending on the value of `low`. Using `np.inf` instead of `None` may fix that, although I think that the author's idea was to avoid having to type the bounds in the `None`/`+/-np.inf` cases. I think that a better option is to have a separate wrapper to `randint` that implements this behavior in a consistent manner and leaves the current function consistent as well. -Joe

Yes, you are correct in explaining my intentions. However, as I also mentioned in the PR discussion, I did not quite understand how your wrapper idea would make things any more comprehensive at the cost of additional overhead and complexity. What do you mean by making the functions "consistent" (i.e. outline the behavior *exactly* depending on the inputs)? As I've explained before, and I will state it again, the different behavior for the high=None and low != None case is due to backwards compatibility. On Wed, Feb 17, 2016 at 6:52 PM, Joseph Fox-Rabinovitz < jfoxrabinovitz@gmail.com> wrote:

My point is that you are proposing to make the overall API have counter-intuitive behavior for the sake of adding a new feature. It is worth a little bit of overhead to have two functions that behave exactly as expected. Josef's footnote is a good example of how people will feel about having to figure out (not to mention remember) the different use cases. I think it is better to keep the current API and just add a "bounded_randint" function for which an input of `None` always means "limit of that bound, no exceptions". -Joe On Wed, Feb 17, 2016 at 2:09 PM, G Young <gfyoung17@gmail.com> wrote:

On Wed, Feb 17, 2016 at 2:09 PM, G Young <gfyoung17@gmail.com> wrote:
One problem is that if there is only one positional argument, then I can still figure out that it might have different meanings. If there are two keywords, then I would assume standard python argument interpretation applies. If I want to save on typing, then I think it should be for a more "standard" case. (I also never sample all real numbers, at least not uniformly.) Josef

On Wed, Feb 17, 2016 at 2:20 PM, <josef.pktd@gmail.com> wrote:
One more thing I don't like: So far all distributions are "theoretical" distributions where the distribution depends on the provided shape, location and scale parameters. There is a limitation in how they are represented as numbers/dtype and what range is possible. However, that is not relevant for most use cases. In this case you are promoting `dtype` from a memory or storage parameter to an actual shape (or loc and scale) parameter. That's "weird", and even more so if this would be the default behavior. There is no proper uniform distribution on all integers. So, this forces users to think about the implementation detail like dtype, when I just want a random sample of a probability distribution. Josef

Behavior of random integer generation: Python randint [a,b] MATLAB randi [a,b] Mma RandomInteger [a,b] haskell randomR [a,b] GAUSS rndi [a,b] Maple rand [a,b] In short, NumPy's `randint` is non-standard (and, I would add, non-intuitive). Presumably was due due to relying on a float draw from [0,1) along with the use of floor. The divergence in behavior between the (later) Python function of the same name is particularly unfortunate. So I suggest further work on this function is not called for, and use of `random_integers` should be encouraged. Probably NumPy's `randint` should be deprecated. If there is any playing with the interface, I think Mma provides a pretty good model. If I were designing the interface, I would always require a tuple argument (for the inclusive range), with possible `None` values to imply datatype extreme values. Proposed name (after `randint` deprecation): `randints`. Cheers, Alan Isaac

On Wed, Feb 17, 2016 at 4:40 PM, Alan Isaac <alan.isaac@gmail.com> wrote:
No, never was. It is implemented so because Python uses semi-open integer intervals by preference because it plays most nicely with 0-based indexing. Not sure about all of those systems, but some at least are 1-based indexing, so closed intervals do make sense. The Python stdlib's random.randint() closed interval is considered a mistake by python-dev leading to the implementation and preference for random.randrange() instead.
The divergence in behavior between the (later) Python function of the same name is particularly unfortunate.
Indeed, but unfortunately, this mistake dates way back to Numeric times, and easing the migration to numpy was a priority in the heady days of numpy 1.0.
Not while I'm here. Instead, `random_integers()` is discouraged and perhaps might eventually be deprecated. -- Robert Kern

Perhaps, but we are not coding in Haskell. We are coding in Python, and the standard is that the endpoint is excluded, which renders your point moot I'm afraid. On Wed, Feb 17, 2016 at 5:10 PM, Alan Isaac <alan.isaac@gmail.com> wrote:

On 2/17/2016 12:28 PM, G Young wrote:
I am not sure what "standard" you are talking about. I thought we were talking about the user interface. Nobody is proposing changing the behavior of `range`. That is an entirely separate question. I'm not trying to change any minds, but let's not rely on spurious arguments. Cheers, Alan

On Wed, Feb 17, 2016 at 8:30 PM, Alan Isaac <alan.isaac@gmail.com> wrote:
It is a persistent and consistent convention (i.e. "standard") across Python APIs that deal with integer ranges (range(), slice(), random.randrange(), ...), particularly those that end up related to indexing; e.g. `x[np.random.randint(0, len(x))]` to pull a random sample from an array. random.randint() was the one big exception, and it was considered a mistake for that very reason, soft-deprecated in favor of random.randrange(). -- Robert Kern

On 2/17/2016 3:42 PM, Robert Kern wrote:
randrange also has its detractors: https://code.activestate.com/lists/python-dev/138358/ and following. I think if we start citing persistant conventions, the persistent convention across *many* languages that the bounds provided for a random integer range are inclusive also counts for something, especially when the names are essentially shared. But again, I am just trying to be clear about what is at issue, not push for a change. I think citing non-existent standards is not helpful. I think the discrepancy between the Python standard library and numpy for a function going by a common name is harmful. (But then, I teach.) fwiw, Alan

Also fwiw, I think the 0-based, half-open interval is one of the best features of Python indexing and yes, I do use random integers to index into my arrays and would not appreciate having to litter my code with "-1" everywhere. On Thu, Feb 18, 2016 at 10:29 AM, Alan Isaac <alan.isaac@gmail.com> wrote:

Your statement is a little self-contradictory, but in any case, you shouldn't worry about random_integers getting removed from the code-base. However, it has been deprecated in favor of randint. On Wed, Feb 17, 2016 at 11:48 PM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:

On 2/17/2016 6:48 PM, Juan Nunez-Iglesias wrote:
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated /numpy.random.choice.html fwiw, Alan Isaac

On 2/17/2016 7:01 PM, Juan Nunez-Iglesias wrote:
Notice the limitation "1D array-like".
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.choi... "If an int, the random sample is generated as if a was np.arange(n)" hth, Alan Isaac

On Wed, Feb 17, 2016 at 7:17 PM, Juan Nunez-Iglesias <jni.soma@gmail.com> wrote:
(un)related aside: my R doc quote about "may lead to undesired behavior" refers to this, IIRC, R's `sample` was the inspiration for this function but numpy distinguishes scalar from one element (1D) arrays
for i in range(3, 10): np.random.choice(np.arange(10)[i:])
Josef

Joe: fair enough. A separate function seems more reasonable. Perhaps it was a wording thing, but you kept saying "wrapper," which is not the same as a separate function. Josef: I don't think we are making people think more. They're all keyword arguments, so if you don't want to think about them, then you leave them as the defaults, and everyone is happy. The 'dtype' keyword was needed by someone who wanted to generate a large array of uint8 random integers and could not just as call 'astype' due to memory constraints. I would suggest you read this issue here <https://github.com/numpy/numpy/issues/6790> and the PR's that followed so that you have a better understanding as to why this 'weird' behavior was chosen. On Wed, Feb 17, 2016 at 8:30 PM, Alan Isaac <alan.isaac@gmail.com> wrote:

Josef: I don't think we are making people think more. They're all keyword arguments, so if you don't want to think about them, then you leave
On Wed, Feb 17, 2016 at 8:43 PM, G Young <gfyoung17@gmail.com> wrote: them as the defaults, and everyone is happy. I believe that Josef has the code's reader in mind, not the code's writer. As a reader of other people's code (and I count 6-months-ago-me as one such "other people"), I am sure to eventually encounter all of the different variants, so I will need to know all of them. -- Robert Kern

I sense that this issue is now becoming more of "randint has become too complicated" I suppose we could always "add" more functions that present simpler interfaces, though if you really do want simple, there's always Python's random library you can use. On Wed, Feb 17, 2016 at 8:48 PM, Robert Kern <robert.kern@gmail.com> wrote:

On Wed, Feb 17, 2016 at 3:58 PM, G Young <gfyoung17@gmail.com> wrote:
I have mostly the users in mind (i.e. me). I like simple patterns where I don't have to stare at a docstring for five minutes to understand it, or pull it up again each time I use it. dtype for storage is different from dtype as distribution parameter. --- aside, since I just read this https://news.ycombinator.com/item?id=11112763 what to avoid. you save a few keystrokes and spend months trying to figure out what's going on. (exaggerated) "*Note* that this convenience feature may lead to undesired behaviour when ..." from R docs Josef

On Mi, 2016-02-17 at 20:48 +0000, Robert Kern wrote:
Completely agree. Greg, if you need more then a few minutes to explain it in this case, there seems little point. It seems to me even the worst cases of your examples would be covered by writing code like: np.random.randint(np.iinfo(np.uint8).min, 10, dtype=np.uint8) And *everyone* will immediately know what is meant with just minor extra effort for writing it. We should keep the analogy to "range" as much as possible. Anything going far beyond that, can be confusing. On first sight I am not convinced that there is a serious convenience gain by doing magic here, but this is a simple case: "Explicit is better then implicit" since writing the explicit code is easy. It might also create weird bugs if the completely unexpected (most users would probably not even realize it existed) happens and you get huge numbers because you happened to have a `low=0` in there. Especially your point 2) seems confusing. As for 3) if I see `np.random.randint(high=3)` I think I would assume [0, 3).... Additionally, I am not sure the maximum int range is such a common need anyway? - Sebastian

On Mi, 2016-02-17 at 22:10 +0100, Sebastian Berg wrote:
OK, that was silly, that is what happens of course. So it is explicit in the sense that you have pass in at least one `None` explicitly. But I am still not sure that the added convenience is big and easy to understand [1], if it was always lowest for low and highest for high, I remember get it, but it seems more complex (though None does also look a a bit like "default" and "default" is 0 for low). - Sebastian [1] As in the trade-off between added complexity vs. added convenience.

On Mi, 2016-02-17 at 21:53 +0000, G Young wrote:
"Explicit is better than implicit" - can't argue with that. It doesn't seem like the PR has gained much traction, so I'll close it.
Thanks for the effort though! Sometimes we get a bit carried away with doing fancy stuff, and I guess the idea is likely a bit too fancy for wide application. - Sebastian

On Wed, Feb 17, 2016 at 10:01 AM, G Young <gfyoung17@gmail.com> wrote:
My impression (*) is that this will be confusing, and uses a default that I never ever needed. Maybe a better way would be to use low=-np.inf and high=np.inf where inf would be interpreted as the smallest and largest representable number. And leave the defaults unchanged. (*) I didn't try to understand how it works for various cases. Josef

On Wed, Feb 17, 2016 at 1:37 PM, <josef.pktd@gmail.com> wrote:
As I mentioned on the PR discussion, the thing that bothers me is the inconsistency between the new and the old functionality, specifically in #2. If high is, the behavior is completely different depending on the value of `low`. Using `np.inf` instead of `None` may fix that, although I think that the author's idea was to avoid having to type the bounds in the `None`/`+/-np.inf` cases. I think that a better option is to have a separate wrapper to `randint` that implements this behavior in a consistent manner and leaves the current function consistent as well. -Joe

Yes, you are correct in explaining my intentions. However, as I also mentioned in the PR discussion, I did not quite understand how your wrapper idea would make things any more comprehensive at the cost of additional overhead and complexity. What do you mean by making the functions "consistent" (i.e. outline the behavior *exactly* depending on the inputs)? As I've explained before, and I will state it again, the different behavior for the high=None and low != None case is due to backwards compatibility. On Wed, Feb 17, 2016 at 6:52 PM, Joseph Fox-Rabinovitz < jfoxrabinovitz@gmail.com> wrote:

My point is that you are proposing to make the overall API have counter-intuitive behavior for the sake of adding a new feature. It is worth a little bit of overhead to have two functions that behave exactly as expected. Josef's footnote is a good example of how people will feel about having to figure out (not to mention remember) the different use cases. I think it is better to keep the current API and just add a "bounded_randint" function for which an input of `None` always means "limit of that bound, no exceptions". -Joe On Wed, Feb 17, 2016 at 2:09 PM, G Young <gfyoung17@gmail.com> wrote:

On Wed, Feb 17, 2016 at 2:09 PM, G Young <gfyoung17@gmail.com> wrote:
One problem is that if there is only one positional argument, then I can still figure out that it might have different meanings. If there are two keywords, then I would assume standard python argument interpretation applies. If I want to save on typing, then I think it should be for a more "standard" case. (I also never sample all real numbers, at least not uniformly.) Josef

On Wed, Feb 17, 2016 at 2:20 PM, <josef.pktd@gmail.com> wrote:
One more thing I don't like: So far all distributions are "theoretical" distributions where the distribution depends on the provided shape, location and scale parameters. There is a limitation in how they are represented as numbers/dtype and what range is possible. However, that is not relevant for most use cases. In this case you are promoting `dtype` from a memory or storage parameter to an actual shape (or loc and scale) parameter. That's "weird", and even more so if this would be the default behavior. There is no proper uniform distribution on all integers. So, this forces users to think about the implementation detail like dtype, when I just want a random sample of a probability distribution. Josef
participants (7)
-
Alan Isaac
-
G Young
-
josef.pktd@gmail.com
-
Joseph Fox-Rabinovitz
-
Juan Nunez-Iglesias
-
Robert Kern
-
Sebastian Berg