From ralf.gommers at gmail.com Fri Jun 1 00:57:06 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Thu, 31 May 2018 21:57:06 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Thu, May 31, 2018 at 4:50 PM, Matti Picus wrote: > At the recent NumPy sprint at BIDS (thanks to those who made the trip) we > spent some time brainstorming about a roadmap for NumPy, in the spirit of > similar work that was done for Jupyter. The idea is that a document with > wide community acceptance can guide the work of the full-time developer(s), > and be a source of ideas for expanding development efforts. > > I put the document up at https://github.com/numpy/numpy/wiki/NumPy-Roadmap, > and hope to discuss it at a BOF session during SciPy in the middle of July > in Austin. > Thanks for writing that up! > > Eventually it could become a NEP or formalized in another way. > A NEP doesn't sound quite right, but moving from wiki to somewhere more formal and with more control over the contents (e.g. numpy.org or in the docs) would be useful. A roadmap could/should also include things like required effort, funding and knowledge/people required. A couple of comments on the content: - a mention of stability or backwards compatibility goals under philosophy would be useful - the "Could potentially be split out into separate packages..." should be removed I think - the maskedarray one was already rejected, and the rest are similarly unhelpful. - "internal refactorings": MaskedArray yes, but the other ones no. numpy.distutils and f2py are very hard to test, a big refactor pretty much guarantees breakage. there's also not much need for refactoring, because those things are not coupled to the numpy.core internals. numpy.financial is simply uninteresting - we wish it wasn't there but it is, so now it simply stays where it is. - One item that I think is missing under "New functionality" is runtime switching of backend for numpy.linalg (IIRC discussed on this list before) and numpy.random (MKL devs are interested in this). Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 1 07:43:32 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 1 Jun 2018 07:43:32 -0400 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: Hi Matti, Thanks for sharing the roadmap. Overall, it looks very nice. A practical question is on whether you want input via the mailing list, or should one just edit the wiki and add questions or so? As the roadmap mentioned interaction with python proper (and a possible PEP): one thing that always slightly annoyed me is that numpy math is way slower for scalars than python math - and duplicates all the function names. It would seem to make sense to allow python's math module to be overridden for non-python input, including arrays. That could be another PEP... All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From nussbaum at uni-mainz.de Fri Jun 1 08:24:48 2018 From: nussbaum at uni-mainz.de (=?UTF-8?Q?Andreas_Nu=C3=9Fbaumer?=) Date: Fri, 1 Jun 2018 14:24:48 +0200 Subject: [Numpy-discussion] Change in default behavior of np.polyfit Message-ID: Hi, in [1] the scaling factor for the covariance matrix of `np.polyfit` was discussed. The conclusion was, that it is non-standard and a patch might be in order to correct this. Pull request [2] changes the factor from chisq(popt)/(M-N-2) to chisq(popt)/(M-N) (with M=number of point, N=number of parameters) essentially removing the "-2". Clearly, this changes the result for the covariance matrix (but not the result for the polynomial coefficients) and therefore the current behavior if `cov=True` is set. It should be noted, that `scipy.optimize.curve_fit` also uses the chisq(popt)/(M-N) as scaling factor (without "-2"). Therefore, the change would remove a discrepancy. Additionally, patch [2] adds an option that sets the scaling factor of the covariance matrix to 1 . This can be useful in occasions, where the weights are given by 1/sigma with sigma being the (known) standard errors of (Gaussian distributed) data points, in which case the un-scaled matrix is already a correct estimate for the covariance matrix. Best, Andreas [1] http://numpy-discussion.10968.n7.nabble.com/Inconsistent-results-for-the-covariance-matrix-between-scipy-optimize-curve-fit-and-numpy-polyfit-td45582.html [2] https://github.com/numpy/numpy/pull/11197 -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 1 08:29:39 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 1 Jun 2018 08:29:39 -0400 Subject: [Numpy-discussion] Change in default behavior of np.polyfit In-Reply-To: References: Message-ID: Hi Andreas, Thanks for noticing and correcting this unexpected scaling! The addition to get the unscaled version is also very welcome. All the best, Marten ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From toddrjen at gmail.com Fri Jun 1 11:27:52 2018 From: toddrjen at gmail.com (Todd) Date: Fri, 1 Jun 2018 11:27:52 -0400 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Thu, May 31, 2018, 19:50 Matti Picus wrote: > At the recent NumPy sprint at BIDS (thanks to those who made the trip) > we spent some time brainstorming about a roadmap for NumPy, in the > spirit of similar work that was done for Jupyter. The idea is that a > document with wide community acceptance can guide the work of the > full-time developer(s), and be a source of ideas for expanding > development efforts. > > I put the document up at > https://github.com/numpy/numpy/wiki/NumPy-Roadmap, and hope to discuss > it at a BOF session during SciPy in the middle of July in Austin. > > Eventually it could become a NEP or formalized in another way. > > Matti > Some things I have seen mentioned but don't know the current plans for: * Categorical arrays * Releasing the GIL wherever possible * Using multithreading internally * making use of the next generation blas when available and stay involved in planning to make sure it supports our needs * Figure out where to use Cython and were not to > -------------- next part -------------- An HTML attachment was scrubbed... URL: From toddrjen at gmail.com Fri Jun 1 11:48:32 2018 From: toddrjen at gmail.com (Todd) Date: Fri, 1 Jun 2018 11:48:32 -0400 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Fri, Jun 1, 2018, 11:27 Todd wrote: > > > On Thu, May 31, 2018, 19:50 Matti Picus wrote: > >> At the recent NumPy sprint at BIDS (thanks to those who made the trip) >> we spent some time brainstorming about a roadmap for NumPy, in the >> spirit of similar work that was done for Jupyter. The idea is that a >> document with wide community acceptance can guide the work of the >> full-time developer(s), and be a source of ideas for expanding >> development efforts. >> >> I put the document up at >> https://github.com/numpy/numpy/wiki/NumPy-Roadmap, and hope to discuss >> it at a BOF session during SciPy in the middle of July in Austin. >> >> Eventually it could become a NEP or formalized in another way. >> >> Matti >> > > > Some things I have seen mentioned but don't know the current plans for: > > * Categorical arrays > * Releasing the GIL wherever possible > * Using multithreading internally > * making use of the next generation blas when available and stay involved > in planning to make sure it supports our needs > * Figure out where to use Cython and were not to > Also: * Figure out the best way to handle strings. This may involve multiple approaches for different situations but the current approach may not be the best default approach. * Decimal and/or rational arrays * if yes to labeled arrays, then there should probably be a pep about label-based indexing * A decision about how to handle numpy 2.0 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Fri Jun 1 12:46:57 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 1 Jun 2018 09:46:57 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Fri, Jun 1, 2018 at 4:43 AM, Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > one thing that always slightly annoyed me is that numpy math is way > slower for scalars than python math > numpy is also quite a bit slower than raw python for math with (very) small arrays: In [31]: % timeit t2 = (t[0] * 10, t[1] * 10) 162 ns ? 0.79 ns per loop (mean ? std. dev. of 7 runs, 10000000 loops each) In [32]: a Out[32]: array([ 3.4, 5.6]) In [33]: % timeit a2 = a * 10 941 ns ? 7.95 ns per loop (mean ? std. dev. of 7 runs, 1000000 loops each) (I often want to so this sort of thing, not for performance, but for ease of computation -- say you have 2 or three coordinates that represent a point -- it's really nice to be able to scale or shift with array operations, rather than all that indexing -- but it is pretty slo with numpy. I've wondered if numpy could be optimized for small 1D arrays, and maybe even 2d arrays with a small fixed second dimension (N x 2, N x 3), by special-casing / short-cutting those cases. It would require some careful profiling to see if it would help, but it sure seems possible. And maybe scalars could be fit into the same system. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanv at berkeley.edu Fri Jun 1 12:57:14 2018 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Fri, 1 Jun 2018 09:57:14 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: <20180601165714.26ykmtlmi3p75iog@carbo> Hi Ralf, On Thu, 31 May 2018 21:57:06 -0700, Ralf Gommers wrote: > - "internal refactorings": MaskedArray yes, but the other ones no. > numpy.distutils and f2py are very hard to test, a big refactor pretty much > guarantees breakage. there's also not much need for refactoring, because > those things are not coupled to the numpy.core internals. numpy.financial > is simply uninteresting - we wish it wasn't there but it is, so now it > simply stays where it is. I want to clarify that in the current notes we put down ideas that prompted active discussion, even if they weren't necessarily feasible. I feel it is important to keep the conversation open to run its course until we have a good understanding of the various issues at hand. You may find that, in person, people are more willing to admit to their support for some "heretical" ideas than they are here on the list. E.g., you say that the financial functions "now simply stay", but that promises a future of a NumPy that never shrinks, while there is certainly some support for allowing NumPy to contract so that we can release maintenance burden and allow development of other core areas that have been neglected for a long time. You will *always* have small, vocal proponents of any specific piece of functionality; that doesn't necessarily mean that such functionality contributes to the health of a project as a whole. So, I gently urge us carefully reconsider the narrative that nothing can change/be removed, and evaluate each suggestion carefully, not weighing only the very evident negatives but also the longer term positives. Best regards, St?fan From chris.barker at noaa.gov Fri Jun 1 13:06:48 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Fri, 1 Jun 2018 10:06:48 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Fri, Jun 1, 2018 at 9:46 AM, Chris Barker wrote: > numpy is also quite a bit slower than raw python for math with (very) > small arrays: > doing a bit more experimentation, the advantage is with pure python for over 10 elements (I got bored...). but I noticed that the time for numpy computation is pretty much constant for 2 up to around 100 elements. Which implies that the bulk of the issue is with "startup" costs, rather than fancy indexing or anything like that. so maybe a short cut wouldn't be helpful. Note if you use a list comp (the pythonic translation of an array operation) thecrossover point is about 15 elements (in my tests, on my machine...) In [90]: % timeit t2 = [x * 10 for x in t] 920 ns ? 4.88 ns per loop (mean ? std. dev. of 7 runs, 1000000 loops each) -CHB > In [31]: % timeit t2 = (t[0] * 10, t[1] * 10) > 162 ns ? 0.79 ns per loop (mean ? std. dev. of 7 runs, 10000000 loops each) > > In [32]: a > Out[32]: array([ 3.4, 5.6]) > > In [33]: % timeit a2 = a * 10 > 941 ns ? 7.95 ns per loop (mean ? std. dev. of 7 runs, 1000000 loops each) > > > (I often want to so this sort of thing, not for performance, but for ease > of computation -- say you have 2 or three coordinates that represent a > point -- it's really nice to be able to scale or shift with array > operations, rather than all that indexing -- but it is pretty slo with > numpy. > > I've wondered if numpy could be optimized for small 1D arrays, and maybe > even 2d arrays with a small fixed second dimension (N x 2, N x 3), by > special-casing / short-cutting those cases. > > It would require some careful profiling to see if it would help, but it > sure seems possible. > > And maybe scalars could be fit into the same system. > > -CHB > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > Chris.Barker at noaa.gov > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From harrigan.matthew at gmail.com Fri Jun 1 13:11:54 2018 From: harrigan.matthew at gmail.com (Matthew Harrigan) Date: Fri, 1 Jun 2018 13:11:54 -0400 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <20180601165714.26ykmtlmi3p75iog@carbo> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> <20180601165714.26ykmtlmi3p75iog@carbo> Message-ID: I would love to see gufuncs become more general. Specifically I would like an optional prologue and epilogue function. The prologue could potentially 1) inspect parameterized dtypes 2) kwargs 3) set non-trivial output array sizes 4) initialize data structures 5) defer processing to other functions (BLAS). The epilogue function could do any clean up of data structures. On Fri, Jun 1, 2018 at 12:57 PM, Stefan van der Walt wrote: > Hi Ralf, > > On Thu, 31 May 2018 21:57:06 -0700, Ralf Gommers wrote: > > - "internal refactorings": MaskedArray yes, but the other ones no. > > numpy.distutils and f2py are very hard to test, a big refactor pretty > much > > guarantees breakage. there's also not much need for refactoring, because > > those things are not coupled to the numpy.core internals. numpy.financial > > is simply uninteresting - we wish it wasn't there but it is, so now it > > simply stays where it is. > > I want to clarify that in the current notes we put down ideas that > prompted active discussion, even if they weren't necessarily feasible. > I feel it is important to keep the conversation open to run its course > until we have a good understanding of the various issues at hand. > > You may find that, in person, people are more willing to admit to their > support for some "heretical" ideas than they are here on the list. > > E.g., you say that the financial functions "now simply stay", but that > promises a future of a NumPy that never shrinks, while there is > certainly some support for allowing NumPy to contract so that we can > release maintenance burden and allow development of other core areas > that have been neglected for a long time. > > You will *always* have small, vocal proponents of any specific piece of > functionality; that doesn't necessarily mean that such functionality > contributes to the health of a project as a whole. > > So, I gently urge us carefully reconsider the narrative that nothing can > change/be removed, and evaluate each suggestion carefully, not weighing > only the very evident negatives but also the longer term positives. > > Best regards, > St?fan > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harrigan.matthew at gmail.com Fri Jun 1 13:19:00 2018 From: harrigan.matthew at gmail.com (Matthew Harrigan) Date: Fri, 1 Jun 2018 13:19:00 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Stephan, good point about use cases. I think its still an odd fit. For example I think np.array_equal(np.zeros((3,3)), np.zeros((2,2))) or np.array_equal([1], ['foo']) would be difficult or impossible to replicate with a potential all_equal gufunc On Thu, May 31, 2018 at 2:00 PM, Stephan Hoyer wrote: > On Wed, May 30, 2018 at 5:01 PM Matthew Harrigan < > harrigan.matthew at gmail.com> wrote: > >> "short-cut to automatically return False if m != n", that seems like a >> silent bug >> > > I guess it depends on the use-cases. This is how np.array_equal() works: > https://docs.scipy.org/doc/numpy/reference/generated/ > numpy.array_equal.html > > We could even imagine incorporating this hypothetical "equality along some > axes with broadcasting" functionality into axis/axes arguments for > array_equal() if we choose this behavior. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Jun 1 13:19:41 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 1 Jun 2018 19:19:41 +0200 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> <20180601165714.26ykmtlmi3p75iog@carbo> Message-ID: <20180601171941.sllo2kikmqnzk2d3@phare.normalesup.org> While we are in the crazy wish-list: having dtypes that are universal enough for pandas to use them and export their columns with them would be my crazy wish. I hope that it would help adding more uniform support for things like categorical variables in the pydata ecosystem. Ga?l From ralf.gommers at gmail.com Fri Jun 1 15:11:17 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Fri, 1 Jun 2018 12:11:17 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <20180601165714.26ykmtlmi3p75iog@carbo> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> <20180601165714.26ykmtlmi3p75iog@carbo> Message-ID: On Fri, Jun 1, 2018 at 9:57 AM, Stefan van der Walt wrote: > Hi Ralf, > > On Thu, 31 May 2018 21:57:06 -0700, Ralf Gommers wrote: > > - "internal refactorings": MaskedArray yes, but the other ones no. > > numpy.distutils and f2py are very hard to test, a big refactor pretty > much > > guarantees breakage. there's also not much need for refactoring, because > > those things are not coupled to the numpy.core internals. numpy.financial > > is simply uninteresting - we wish it wasn't there but it is, so now it > > simply stays where it is. > > I want to clarify that in the current notes we put down ideas that > prompted active discussion, even if they weren't necessarily feasible. > I feel it is important to keep the conversation open to run its course > until we have a good understanding of the various issues at hand. > > You may find that, in person, people are more willing to admit to their > support for some "heretical" ideas than they are here on the list. > Thanks Stefan, good points. I totally agree that anything can be discussed. > > E.g., you say that the financial functions "now simply stay", but that > promises a future of a NumPy that never shrinks, while there is > certainly some support for allowing NumPy to contract so that we can > release maintenance burden and allow development of other core areas > that have been neglected for a long time. > > You will *always* have small, vocal proponents of any specific piece of > functionality; that doesn't necessarily mean that such functionality > contributes to the health of a project as a whole. > > So, I gently urge us carefully reconsider the narrative that nothing can > change/be removed, and evaluate each suggestion carefully, not weighing > only the very evident negatives but also the longer term positives. > I don't think there's such a narrative - e.g. the removal of np.matrix that we've planned and getting rid of MaskedArray at some point once we have a better new masked array implementation are *major* removals. We do plan those things because they have major benefits. Imho "major benefits" is a bar that needs to be passed before listing features as up for removal on a roadmap (even a draft one). It would be helpful maybe to find a form for the roadmap where the essentials of such discussions (key pros/cons) can be captured. Or at least split it in good/desirable/planned items and "wild ideas". Re `financial`, there isn't much of a pro as far as I can tell - there's almost zero maintenance cost now, and it doesn't hinder any of the proposed new features. Plus it's a discussion we've had a couple of times before. I know that the current roadmap doc is only draft, but it still says "NumPy Roadmap" and it's the best thing we have now, so I'd prefer to not have things there (or have them in a separate random/controversial ideas section) that are unlikely to happen or for which it's unclear if they're good ideas. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 1 16:17:12 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 1 Jun 2018 14:17:12 -0600 Subject: [Numpy-discussion] Python 3 compatible examples Message-ID: Hi All, This post is prompted by this PR . It would be good to come up with a timeline and plan for rewriting the examples to be Python 3 compatible. When we do so, we should also make it assumed that `from __future__ import print_function` has been executed when the examples are executed in Python 2.7. Might want to include `division` in that future import as well. Anyway, wanted to raise the subject. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From pav at iki.fi Fri Jun 1 16:22:20 2018 From: pav at iki.fi (Pauli Virtanen) Date: Fri, 01 Jun 2018 22:22:20 +0200 Subject: [Numpy-discussion] Python 3 compatible examples In-Reply-To: References: Message-ID: pe, 2018-06-01 kello 14:17 -0600, Charles R Harris kirjoitti: > This post is prompted by this PR /11222>. > It would be good to come up with a timeline and plan for rewriting > the > examples to be Python 3 compatible. When we do so, we should also > make it > assumed that `from __future__ import print_function` has been > executed when > the examples are executed in Python 2.7. Might want to include > `division` > in that future import as well. > > Anyway, wanted to raise the subject. Thoughts? For Scipy, we converted the examples in the documentation to Python 3, and have essentially ignored Python 2 compatibility. So far, I remember no complaints about it. Pauli From jni.soma at gmail.com Fri Jun 1 16:43:19 2018 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Sat, 02 Jun 2018 06:43:19 +1000 Subject: [Numpy-discussion] Python 3 compatible examples In-Reply-To: References: Message-ID: <1527885799.4058856.1393493968.7F81BC66@webmail.messagingengine.com> On Sat, Jun 2, 2018, at 6:22 AM, Pauli Virtanen wrote: > For Scipy, we converted the examples in the documentation to Python 3, > and have essentially ignored Python 2 compatibility. So far, I remember > no complaints about it. I vote for what Pauli said. From m.h.vankerkwijk at gmail.com Fri Jun 1 17:21:32 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 1 Jun 2018 17:21:32 -0400 Subject: [Numpy-discussion] Python 3 compatible examples In-Reply-To: <1527885799.4058856.1393493968.7F81BC66@webmail.messagingengine.com> References: <1527885799.4058856.1393493968.7F81BC66@webmail.messagingengine.com> Message-ID: Agreed, good to get started and stop worrying about python2 in the examples at least. ?If someone cuts&pastes and it fails, it is just a good reminder to get moving... -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From millman at berkeley.edu Fri Jun 1 17:29:13 2018 From: millman at berkeley.edu (Jarrod Millman) Date: Fri, 1 Jun 2018 14:29:13 -0700 Subject: [Numpy-discussion] Python 3 compatible examples In-Reply-To: <1527885799.4058856.1393493968.7F81BC66@webmail.messagingengine.com> References: <1527885799.4058856.1393493968.7F81BC66@webmail.messagingengine.com> Message-ID: +1 On Fri, Jun 1, 2018 at 1:43 PM, Juan Nunez-Iglesias wrote: > > On Sat, Jun 2, 2018, at 6:22 AM, Pauli Virtanen wrote: >> For Scipy, we converted the examples in the documentation to Python 3, >> and have essentially ignored Python 2 compatibility. So far, I remember >> no complaints about it. > > I vote for what Pauli said. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion From millman at berkeley.edu Fri Jun 1 17:31:00 2018 From: millman at berkeley.edu (Jarrod Millman) Date: Fri, 1 Jun 2018 14:31:00 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> <20180601165714.26ykmtlmi3p75iog@carbo> Message-ID: I like the idea of a random/controversial ideas section. On Fri, Jun 1, 2018 at 12:11 PM, Ralf Gommers wrote: > > > On Fri, Jun 1, 2018 at 9:57 AM, Stefan van der Walt > wrote: >> >> Hi Ralf, >> >> On Thu, 31 May 2018 21:57:06 -0700, Ralf Gommers wrote: >> > - "internal refactorings": MaskedArray yes, but the other ones no. >> > numpy.distutils and f2py are very hard to test, a big refactor pretty >> > much >> > guarantees breakage. there's also not much need for refactoring, because >> > those things are not coupled to the numpy.core internals. >> > numpy.financial >> > is simply uninteresting - we wish it wasn't there but it is, so now it >> > simply stays where it is. >> >> I want to clarify that in the current notes we put down ideas that >> prompted active discussion, even if they weren't necessarily feasible. >> I feel it is important to keep the conversation open to run its course >> until we have a good understanding of the various issues at hand. >> >> You may find that, in person, people are more willing to admit to their >> support for some "heretical" ideas than they are here on the list. > > > Thanks Stefan, good points. I totally agree that anything can be discussed. > >> >> >> E.g., you say that the financial functions "now simply stay", but that >> promises a future of a NumPy that never shrinks, while there is >> certainly some support for allowing NumPy to contract so that we can >> release maintenance burden and allow development of other core areas >> that have been neglected for a long time. >> >> You will *always* have small, vocal proponents of any specific piece of >> functionality; that doesn't necessarily mean that such functionality >> contributes to the health of a project as a whole. >> >> So, I gently urge us carefully reconsider the narrative that nothing can >> change/be removed, and evaluate each suggestion carefully, not weighing >> only the very evident negatives but also the longer term positives. > > > I don't think there's such a narrative - e.g. the removal of np.matrix that > we've planned and getting rid of MaskedArray at some point once we have a > better new masked array implementation are *major* removals. We do plan > those things because they have major benefits. Imho "major benefits" is a > bar that needs to be passed before listing features as up for removal on a > roadmap (even a draft one). > > It would be helpful maybe to find a form for the roadmap where the > essentials of such discussions (key pros/cons) can be captured. Or at least > split it in good/desirable/planned items and "wild ideas". > > Re `financial`, there isn't much of a pro as far as I can tell - there's > almost zero maintenance cost now, and it doesn't hinder any of the proposed > new features. Plus it's a discussion we've had a couple of times before. > > I know that the current roadmap doc is only draft, but it still says "NumPy > Roadmap" and it's the best thing we have now, so I'd prefer to not have > things there (or have them in a separate random/controversial ideas section) > that are unlikely to happen or for which it's unclear if they're good ideas. > > Cheers, > Ralf > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From m.h.vankerkwijk at gmail.com Fri Jun 1 17:41:18 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 1 Jun 2018 17:41:18 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Hi Nathaniel, On Matt's prompting, I added release notes to the frozen/flexible PR [1]; see text attached below. Having done that, I felt the examples actually justified the frozen dimensions quite well. Given that you're the who expressed most doubts about them, could you have a look? Ideally, I'd avoid having to write a NEP for this, and the examples do seem to make it quite obvious that this change to the signature is the way to go, as its meaning is dead obvious. And the implementation is super-straightforward... For the broadcasted core dimensions, I do agree the case is less strong and the meaning perhaps less obvious (implementation is relatively simple), and I think a short NEP may be called for (unless others on the list have super-convincing use cases...). I will add here, though, that even if we implement `all_equal` as a method on `equal`, it would still be useful to have a signature that can actually describe it. -- Marten [1] https://github.com/numpy/numpy/pull/11175/files Generalized ufunc signatures now allow fixed-size dimensions ------------------------------------------------------------ By using a numerical value in the signature of a generalized ufunc, one can indicate that the given function requires input or output to have dimensions with the given size. E.g., the signature of a function that converts a polar angle to a two-dimensional cartesian unit vector would be ``()->(2)``; that for one that converts two spherical angles to a three-dimensional unit vector would be ``(),()->(3)``; and that for the cross product of two three-dimensional vectors would be ``(3),(3)->(3)``. Note that to the elementary function these dimensions are not treated any differently from variable ones indicated with a letter; the loop still is passed the corresponding size, but it can now count on that being equal to the fixed size given in the signature. Generalized ufunc signatures now allow flexible dimensions ---------------------------------------------------------- Some functions, in particular numpy's implementation of ``@`` as ``matmul``, are very similar to generalized ufuncs in that they operate over core dimensions, but one could not present them as such because they were able to deal with inputs in which a dimension is missing. To support this, it is now allowed to postfix a dimension name with a question mark to indicate that that dimension does not necessarily have to be present. With this addition, the signature for ``matmul`` can be expressed as ``(m?,n),(n,p?)->(m?,p?)``. This indicates that if, e.g., the second operand has only one dimension, for the purposes of the elementary function it will be treated as if that input has core shape ``(n, 1)``, and the output has the corresponding core shape of ``(m, 1)``. The actual output array, however, has flexible dimension removed, i.e., it will have shape ``(..., n)``. Similarly, if both arguments have only a single dimension, the inputs will be presented as having shapes ``(1, n)`` and ``(n, 1)`` to the elementary function, and the output as ``(1, 1)``, while the actual output array returned will have shape ``()``. In this way, the signature thus allows one to use a single elementary function for four related but different signatures, ``(m,n),(n,p)->(m,p)``, ``(n),(n,p)->(p)``, ``(m,n),(n)->(m)`` and ``(n),(n)->()``. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 1 17:43:48 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 1 Jun 2018 15:43:48 -0600 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: On Thu, May 31, 2018 at 5:50 PM, Matti Picus wrote: > At the recent NumPy sprint at BIDS (thanks to those who made the trip) we > spent some time brainstorming about a roadmap for NumPy, in the spirit of > similar work that was done for Jupyter. The idea is that a document with > wide community acceptance can guide the work of the full-time developer(s), > and be a source of ideas for expanding development efforts. > > I put the document up at https://github.com/numpy/numpy/wiki/NumPy-Roadmap, > and hope to discuss it at a BOF session during SciPy in the middle of July > in Austin. > > Eventually it could become a NEP or formalized in another way. > > Matti > Under maintenance we could add something about the transition to Python 3, in particular cleaning up the code and updating the documentation examples. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jun 1 18:45:39 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 1 Jun 2018 15:45:39 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: On Fri, Jun 1, 2018 at 2:42 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > Having done that, I felt the examples actually justified the frozen > dimensions quite well. Given that you're the who expressed most doubts > about them, could you have a look? Ideally, I'd avoid having to write a NEP > for this, and the examples do seem to make it quite obvious that this > change to the signature is the way to go, as its meaning is dead obvious. > And the implementation is super-straightforward... > I do think it would be valuable to have a brief NEP on this, especially on the solution for matmul. NEPs don't have to be long, and don't need to go into the full detail of implementations. But they are a nice place to summarize design discussions. In fact, I would say the text you have below is nearly enough for one or two NEPs. The parts that are missing would be valuable to add anyways: - A brief discussion (a sentence or two) of potential broader use-cases for optional dimensions (ufuncs that act on row/column vectors and matrices). - A brief discussion of rejected alternatives (only a few sentences for each alternative). -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 1 19:38:41 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 1 Jun 2018 19:38:41 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: For the flexible dimensions, that would be up to Nathaniel -- it's his idea ;-) And happily that means that I don't have to spend time looking up how this NEP business actually works, but can just copy & paste... -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Jun 2 15:04:32 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 2 Jun 2018 12:04:32 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy Message-ID: As promised distressingly many months ago, I have written up a NEP about relaxing the stream-compatibility policy that we currently have. https://github.com/numpy/numpy/pull/11229 https://github.com/rkern/numpy/blob/nep/rng/doc/neps/nep-0019-rng-policy.rst I particularly invite comment on the two lists of methods that we still would make strict compatibility guarantees for. --- ============================== Random Number Generator Policy ============================== :Author: Robert Kern :Status: Draft :Type: Standards Track :Created: 2018-05-24 Abstract -------- For the past decade, NumPy has had a strict backwards compatibility policy for the number stream of all of its random number distributions. Unlike other numerical components in ``numpy``, which are usually allowed to return different when results when they are modified if they remain correct, we have obligated the random number distributions to always produce the exact same numbers in every version. The objective of our stream-compatibility guarantee was to provide exact reproducibility for simulations across numpy versions in order to promote reproducible research. However, this policy has made it very difficult to enhance any of the distributions with faster or more accurate algorithms. After a decade of experience and improvements in the surrounding ecosystem of scientific software, we believe that there are now better ways to achieve these objectives. We propose relaxing our strict stream-compatibility policy to remove the obstacles that are in the way of accepting contributions to our random number generation capabilities. The Status Quo -------------- Our current policy, in full: A fixed seed and a fixed series of calls to ``RandomState`` methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged. This policy was first instated in Nov 2008 (in essence; the full set of weasel words grew over time) in response to a user wanting to be sure that the simulations that formed the basis of their scientific publication could be reproduced years later, exactly, with whatever version of ``numpy`` that was current at the time. We were keen to support reproducible research, and it was still early in the life of ``numpy.random``. We had not seen much cause to change the distribution methods all that much. We also had not thought very thoroughly about the limits of what we really could promise (and by ?we? in this section, we really mean Robert Kern, let?s be honest). Despite all of the weasel words, our policy overpromises compatibility. The same version of ``numpy`` built on different platforms, or just in a different way could cause changes in the stream, with varying degrees of rarity. The biggest is that the ``.multivariate_normal()`` method relies on ``numpy.linalg`` functions. Even on the same platform, if one links ``numpy`` with a different LAPACK, ``.multivariate_normal()`` may well return completely different results. More rarely, building on a different OS or CPU can cause differences in the stream. We use C ``long`` integers internally for integer distribution (it seemed like a good idea at the time), and those can vary in size depending on the platform. Distribution methods can overflow their internal C ``longs`` at different breakpoints depending on the platform and cause all of the random variate draws that follow to be different. And even if all of that is controlled, our policy still does not provide exact guarantees across versions. We still do apply bug fixes when correctness is at stake. And even if we didn?t do that, any nontrivial program does more than just draw random numbers. They do computations on those numbers, transform those with numerical algorithms from the rest of ``numpy``, which is not subject to so strict a policy. Trying to maintain stream-compatibility for our random number distributions does not help reproducible research for these reasons. The standard practice now for bit-for-bit reproducible research is to pin all of the versions of code of your software stack, possibly down to the OS itself. The landscape for accomplishing this is much easier today than it was in 2008. We now have ``pip``. We now have virtual machines. Those who need to reproduce simulations exactly now can (and ought to) do so by using the exact same version of ``numpy``. We do not need to maintain stream-compatibility across ``numpy`` versions to help them. Our stream-compatibility guarantee has hindered our ability to make improvements to ``numpy.random``. Several first-time contributors have submitted PRs to improve the distributions, usually by implementing a faster, or more accurate algorithm than the one that is currently there. Unfortunately, most of them would have required breaking the stream to do so. Blocked by our policy, and our inability to work around that policy, many of those contributors simply walked away. Implementation -------------- We propose first freezing ``RandomState`` as it is and developing a new RNG subsystem alongside it. This allows anyone who has been relying on our old stream-compatibility guarantee to have plenty of time to migrate. ``RandomState`` will be considered deprecated, but with a long deprecation cycle, at least a few years. Deprecation warnings will start silent but become increasingly noisy over time. Bugs in the current state of the code will *not* be fixed if fixing them would impact the stream. However, if changes in the rest of ``numpy`` would break something in the ``RandomState`` code, we will fix ``RandomState`` to continue working (for example, some change in the C API). No new features will be added to ``RandomState``. Users should migrate to the new subsystem as they are able to. Work on a proposed `new PRNG subsystem `_ is already underway. The specifics of the new design are out of scope for this NEP and up for much discussion, but we will discuss general policies that will guide the evolution of whatever code is adopted. First, we will maintain API source compatibility just as we do with the rest of ``numpy``. If we *must* make a breaking change, we will only do so with an appropriate deprecation period and warnings. Second, breaking stream-compatibility in order to introduce new features or improve performance will be *allowed* with *caution*. Such changes will be considered features, and as such will be no faster than the standard release cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness is not a bug. Correctness bug fixes that break stream-compatibility can happen on bugfix releases, per usual, but developers should consider if they can wait until the next feature release. We encourage developers to strongly weight user?s pain from the break in stream-compatibility against the improvements. One example of a worthwhile improvement would be to change algorithms for a significant increase in performance, for example, moving from the `Box-Muller transform `_ method of Gaussian variate generation to the faster `Ziggurat algorithm `_. An example of an unworthy improvement would be tweaking the Ziggurat tables just a little bit. Any new design for the RNG subsystem will provide a choice of different core uniform PRNG algorithms. We will be more strict about a select subset of methods on these core PRNG objects. They MUST guarantee stream-compatibility for a minimal, specified set of methods which are chosen to make it easier to compose them to build other distributions. Namely, * ``.bytes()`` * ``.random_uintegers()`` * ``.random_sample()`` Furthermore, the new design should also provide one generator class (we shall call it ``StableRandom`` for discussion purposes) that provides a slightly broader subset of distribution methods for which stream-compatibility is *guaranteed*. The point of ``StableRandom`` is to provide something that can be used in unit tests so projects that currently have tests which rely on the precise stream can be migrated off of ``RandomState``. For the best transition, ``StableRandom`` should use as its core uniform PRNG the current MT19937 algorithm. As best as possible, the API for the distribution methods that are provided on ``StableRandom`` should match their counterparts on ``RandomState``. They should provide the same stream that the current version of ``RandomState`` does. Because their intended use is for unit tests, we do not need the performance improvements from the new algorithms that will be introduced by the new subsystem. The list of ``StableRandom`` methods should be chosen to support unit tests: * ``.randint()`` * ``.uniform()`` * ``.normal()`` * ``.standard_normal()`` * ``.choice()`` * ``.shuffle()`` * ``.permutation()`` Not Versioning -------------- For a long time, we considered that the way to allow algorithmic improvements while maintaining the stream was to apply some form of versioning. That is, every time we make a stream change in one of the distributions, we increment some version number somewhere. ``numpy.random`` would keep all past versions of the code, and there would be a way to get the old versions. Proposals of how to do this exactly varied widely, but we will not exhaustively list them here. We spent years going back and forth on these designs and were not able to find one that sufficed. Let that time lost, and more importantly, the contributors that we lost while we dithered, serve as evidence against the notion. Concretely, adding in versioning makes maintenance of ``numpy.random`` difficult. Necessarily, we would be keeping lots of versions of the same code around. Adding a new algorithm safely would still be quite hard. But most importantly, versioning is fundamentally difficult to *use* correctly. We want to make it easy and straightforward to get the latest, fastest, best versions of the distribution algorithms; otherwise, what's the point? The way to make that easy is to make the latest the default. But the default will necessarily change from release to release, so the user?s code would need to be altered anyway to specify the specific version that one wants to replicate. Adding in versioning to maintain stream-compatibility would still only provide the same level of stream-compatibility that we currently do, with all of the limitations described earlier. Given that the standard practice for such needs is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` alone is superfluous. Discussion ---------- - https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html - https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 Copyright --------- This document has been placed in the public domain. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sat Jun 2 18:55:23 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sat, 2 Jun 2018 15:55:23 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= Message-ID: Matthew Rocklin and I have written NEP-18, which proposes a new dispatch mechanism for NumPy's high level API: http://www.numpy.org/neps/nep-0018-array-function-protocol.html There has already been a little bit of scattered discussion on the pull request (https://github.com/numpy/numpy/pull/11189), but per NEP-0 let's try to keep high-level discussion here on the mailing list. The full text of the NEP is reproduced below: ================================================== NEP: Dispatch Mechanism for NumPy's high level API ================================================== :Author: Stephan Hoyer :Author: Matthew Rocklin :Status: Draft :Type: Standards Track :Created: 2018-05-29 Abstact ------- We propose a protocol to allow arguments of numpy functions to define how that function operates on them. This allows other libraries that implement NumPy's high level API to reuse Numpy functions. This allows libraries that extend NumPy's high level API to apply to more NumPy-like libraries. Detailed description -------------------- Numpy's high level ndarray API has been implemented several times outside of NumPy itself for different architectures, such as for GPU arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel arrays (Dask array) as well as various Numpy-like implementations in the deep learning frameworks, like TensorFlow and PyTorch. Similarly there are several projects that build on top of the Numpy API for labeled and indexed arrays (XArray), automatic differentation (Autograd, Tangent), higher order array factorizations (TensorLy), etc. that add additional functionality on top of the Numpy API. We would like to be able to use these libraries together, for example we would like to be able to place a CuPy array within XArray, or perform automatic differentiation on Dask array code. This would be easier to accomplish if code written for NumPy ndarrays could also be used by other NumPy-like projects. For example, we would like for the following code example to work equally well with any Numpy-like array object: .. code:: python def f(x): y = np.tensordot(x, x.T) return np.mean(np.exp(y)) Some of this is possible today with various protocol mechanisms within Numpy. - The ``np.exp`` function checks the ``__array_ufunc__`` protocol - The ``.T`` method works using Python's method dispatch - The ``np.mean`` function explicitly checks for a ``.mean`` method on the argument However other functions, like ``np.tensordot`` do not dispatch, and instead are likely to coerce to a Numpy array (using the ``__array__``) protocol, or err outright. To achieve enough coverage of the NumPy API to support downstream projects like XArray and autograd we want to support *almost all* functions within Numpy, which calls for a more reaching protocol than just ``__array_ufunc__``. We would like a protocol that allows arguments of a NumPy function to take control and divert execution to another function (for example a GPU or parallel implementation) in a way that is safe and consistent across projects. Implementation -------------- We propose adding support for a new protocol in NumPy, ``__array_function__``. This protocol is intended to be a catch-all for NumPy functionality that is not covered by existing protocols, like reductions (like ``np.sum``) or universal functions (like ``np.exp``). The semantics are very similar to ``__array_ufunc__``, except the operation is specified by an arbitrary callable object rather than a ufunc instance and method. The interface ~~~~~~~~~~~~~ We propose the following signature for implementations of ``__array_function__``: .. code-block:: python def __array_function__(self, func, types, args, kwargs) - ``func`` is an arbitrary callable exposed by NumPy's public API, which was called in the form ``func(*args, **kwargs)``. - ``types`` is a list of types for all arguments to the original NumPy function call that will be checked for an ``__array_function__`` implementation. - The tuple ``args`` and dict ``**kwargs`` are directly passed on from the original call. Unlike ``__array_ufunc__``, there are no high-level guarantees about the type of ``func``, or about which of ``args`` and ``kwargs`` may contain objects implementing the array API. As a convenience for ``__array_function__`` implementors of the NumPy API, the ``types`` keyword contains a list of all types that implement the ``__array_function__`` protocol. This allows downstream implementations to quickly determine if they are likely able to support the operation. Still be determined: what guarantees can we offer for ``types``? Should we promise that types are unique, and appear in the order in which they are checked? Example for a project implementing the NumPy API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most implementations of ``__array_function__`` will start with two checks: 1. Is the given function something that we know how to overload? 2. Are all arguments of a type that we know how to handle? If these conditions hold, ``__array_function__`` should return the result from calling its implementation for ``func(*args, **kwargs)``. Otherwise, it should return the sentinel value ``NotImplemented``, indicating that the function is not implemented by these types. .. code:: python class MyArray: def __array_function__(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented if not all(issubclass(t, MyArray) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs) HANDLED_FUNCTIONS = { np.concatenate: my_concatenate, np.broadcast_to: my_broadcast_to, np.sum: my_sum, ... } Necessary changes within the Numpy codebase itself ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This will require two changes within the Numpy codebase: 1. A function to inspect available inputs, look for the ``__array_function__`` attribute on those inputs, and call those methods appropriately until one succeeds. This needs to be fast in the common all-NumPy case. This is one additional function of moderate complexity. 2. Calling this function within all relevant Numpy functions. This affects many parts of the Numpy codebase, although with very low complexity. Finding and calling the right ``__array_function__`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a Numpy function, ``*args`` and ``**kwargs`` inputs, we need to search through ``*args`` and ``**kwargs`` for all appropriate inputs that might have the ``__array_function__`` attribute. Then we need to select among those possible methods and execute the right one. Negotiating between several possible implementations can be complex. Finding arguments ''''''''''''''''' Valid arguments may be directly in the ``*args`` and ``**kwargs``, such as in the case for ``np.tensordot(left, right, out=out)``, or they may be nested within lists or dictionaries, such as in the case of ``np.concatenate([x, y, z])``. This can be problematic for two reasons: 1. Some functions are given long lists of values, and traversing them might be prohibitively expensive 2. Some function may have arguments that we don't want to inspect, even if they have the ``__array_function__`` method To resolve these we ask the functions to provide an explicit list of arguments that should be traversed. This is the ``relevant_arguments=`` keyword in the examples below. Trying ``__array_function__`` methods until the right one works ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Many arguments may implement the ``__array_function__`` protocol. Some of these may decide that, given the available inputs, they are unable to determine the correct result. How do we call the right one? If several are valid then which has precedence? The rules for dispatch with ``__array_function__`` match those for ``__array_ufunc__`` (see `NEP-13 `_). In particular: - NumPy will gather implementations of ``__array_function__`` from all specified inputs and call them in order: subclasses before superclasses, and otherwise left to right. Note that in some edge cases, this differs slightly from the `current behavior `_ of Python. - Implementations of ``__array_function__`` indicate that they can handle the operation by returning any value other than ``NotImplemented``. - If all ``__array_function__`` methods return ``NotImplemented``, NumPy will raise ``TypeError``. Changes within Numpy functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a function defined above, for now call it ``do_array_function_dance``, we now need to call that function from within every relevant Numpy function. This is a pervasive change, but of fairly simple and innocuous code that should complete quickly and without effect if no arguments implement the ``__array_function__`` protocol. Let us consider a few examples of NumPy functions and how they might be affected by this change: .. code:: python def broadcast_to(array, shape, subok=False): success, value = do_array_function_dance( func=broadcast_to, relevant_arguments=[array], args=(array,), kwargs=dict(shape=shape, subok=subok)) if success: return value ... # continue with the definition of broadcast_to def concatenate(arrays, axis=0, out=None) success, value = do_array_function_dance( func=concatenate, relevant_arguments=[arrays, out], args=(arrays,), kwargs=dict(axis=axis, out=out)) if success: return value ... # continue with the definition of concatenate The list of objects passed to ``relevant_arguments`` are those that should be inspected for ``__array_function__`` implementations. Alternatively, we could write these overloads with a decorator, e.g., .. code:: python @overload_for_array_function(['array']) def broadcast_to(array, shape, subok=False): ... # continue with the definition of broadcast_to @overload_for_array_function(['arrays', 'out']) def concatenate(arrays, axis=0, out=None): ... # continue with the definition of concatenate The decorator ``overload_for_array_function`` would be written in terms of ``do_array_function_dance``. The downside of this approach would be a loss of introspection capability for NumPy functions on Python 2, since this requires the use of ``inspect.Signature`` (only available on Python 3). However, NumPy won't be supporting Python 2 for `very much longer < http://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html>`_. Use outside of NumPy ~~~~~~~~~~~~~~~~~~~~ Nothing about this protocol that is particular to NumPy itself. Should we enourage use of the same ``__array_function__`` protocol third-party libraries for overloading non-NumPy functions, e.g., for making array-implementation generic functionality in SciPy? This would offer significant advantages (SciPy wouldn't need to invent its own dispatch system) and no downsides that we can think of, because every function that dispatches with ``__array_function__`` already needs to be explicitly recognized. Libraries like Dask, CuPy, and Autograd already wrap a limited subset of SciPy functionality (e.g., ``scipy.linalg``) similarly to how they wrap NumPy. If we want to do this, we should consider exposing the helper function ``do_array_function_dance()`` above as a public API. Non-goals --------- We are aiming for basic strategy that can be relatively mechanistically applied to almost all functions in NumPy's API in a relatively short period of time, the development cycle of a single NumPy release. We hope to get both the ``__array_function__`` protocol and all specific overloads right on the first try, but our explicit aim here is to get something that mostly works (and can be iterated upon), rather than to wait for an optimal implementation. The price of moving fast is that for now **this protocol should be considered strictly experimental**. We reserve the right to change the details of this protocol and how specific NumPy functions use it at any time in the future -- even in otherwise bug-fix only releases of NumPy. In particular, we don't plan to write additional NEPs that list all specific functions to overload, with exactly how they should be overloaded. We will leave this up to the discretion of committers on individual pull requests, trusting that they will surface any controversies for discussion by interested parties. However, we already know several families of functions that should be explicitly exclude from ``__array_function__``. These will need their own protocols: - universal functions, which already have their own protocol. - ``array`` and ``asarray``, because they are explicitly intended for coercion to actual ``numpy.ndarray`` object. - dispatch for methods of any kind, e.g., methods on ``np.random.RandomState`` objects. As a concrete example of how we expect to break behavior in the future, some functions such as ``np.where`` are currently not NumPy universal functions, but conceivably could become universal functions in the future. When/if this happens, we will change such overloads from using ``__array_function__`` to the more specialized ``__array_ufunc__``. Backward compatibility ---------------------- This proposal does not change existing semantics, except for those arguments that currently have ``__array_function__`` methods, which should be rare. Alternatives ------------ Specialized protocols ~~~~~~~~~~~~~~~~~~~~~ We could (and should) continue to develop protocols like ``__array_ufunc__`` for cohesive subsets of Numpy functionality. As mentioned above, if this means that some functions that we overload with ``__array_function__`` should switch to a new protocol instead, that is explicitly OK for as long as ``__array_function__`` retains its experimental status. Separate namespace ~~~~~~~~~~~~~~~~~~ A separate namespace for overloaded functions is another possibility, either inside or outside of NumPy. This has the advantage of alleviating any possible concerns about backwards compatibility and would provide the maximum freedom for quick experimentation. In the long term, it would provide a clean abstration layer, separating NumPy's high level API from default implementations on ``numpy.ndarray`` objects. The downsides are that this would require an explicit opt-in from all existing code, e.g., ``import numpy.api as np``, and in the long term would result in the maintainence of two separate NumPy APIs. Also, many functions from ``numpy`` itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist. Multiple dispatch ~~~~~~~~~~~~~~~~~ An alternative to our suggestion of the ``__array_function__`` protocol would be implementing NumPy's core functions as `multi-methods `_. Although one of us wrote a `multiple dispatch library `_ for Python, we don't think this approach makes sense for NumPy in the near term. The main reason is that NumPy already has a well-proven dispatching mechanism with ``__array_ufunc__``, based on Python's own dispatching system for arithemtic, and it would be confusing to add another mechanism that works in a very different way. This would also be more invasive change to NumPy itself, which would need to gain a multiple dispatch implementation. It is possible that multiple dispatch implementation for NumPy's high level API could make sense in the future. Fortunately, ``__array_function__`` does not preclude this possibility, because it would be straightforward to write a shim for a default ``__array_function__`` implementation in terms of multiple dispatch. Implementations in terms of a limited core API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The internal implemenations of some NumPy functions is extremely simple. For example: - ``np.stack()`` is implemented in only a few lines of code by combining indexing with ``np.newaxis``, ``np.concatenate`` and the ``shape`` attribute. - ``np.mean()`` is implemented internally in terms of ``np.sum()``, ``np.divide()``, ``.astype()`` and ``.shape``. This suggests the possibility of defining a minimal "core" ndarray interface, and relying upon it internally in NumPy to implement the full API. This is an attractive option, because it could significantly reduce the work required for new array implementations. However, this also comes with several downsides: 1. The details of how NumPy implements a high-level function in terms of overloaded functions now becomes an implicit part of NumPy's public API. For example, refactoring ``stack`` to use ``np.block()`` instead of ``np.concatenate()`` internally would now become a breaking change. 2. Array libraries may prefer to implement high level functions differently than NumPy. For example, a library might prefer to implement a fundamental operations like ``mean()`` directly rather than relying on ``sum()`` followed by division. More generally, it's not clear yet what exactly qualifies as core functionality, and figuring this out could be a large project. 3. We don't yet have an overloading system for attributes and methods on array objects, e.g., for accessing ``.dtype`` and ``.shape``. This should be the subject of a future NEP, but until then we should be reluctant to rely on these properties. Given these concerns, we encourage relying on this approach only in limited cases. Coersion to a NumPy array as a catch-all fallback ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With the current design, classes that implement ``__array_function__`` to overload at least one function implicitly declare an intent to implement the entire NumPy API. It's not possible to implement *only* ``np.concatenate()`` on a type, but fall back to NumPy's default behavior of casting with ``np.asarray()`` for all other functions. This could present a backwards compatibility concern that would discourage libraries from adopting ``__array_function__`` in an incremental fashion. For example, currently most numpy functions will implicitly convert ``pandas.Series`` objects into NumPy arrays, behavior that assuredly many pandas users rely on. If pandas implemented ``__array_function__`` only for ``np.concatenate``, unrelated NumPy functions like ``np.nanmean`` would suddenly break on pandas objects by raising TypeError. With ``__array_ufunc__``, it's possible to alleviate this concern by casting all arguments to numpy arrays and re-calling the ufunc, but the heterogeneous function signatures supported by ``__array_function__`` make it impossible to implement this generic fallback behavior for ``__array_function__``. We could resolve this issue by change the handling of return values in ``__array_function__`` in either of two possible ways: 1. Change the meaning of all arguments returning ``NotImplemented`` to indicate that all arguments should be coerced to NumPy arrays instead. However, many array libraries (e.g., scipy.sparse) really don't want implicit conversions to NumPy arrays, and often avoid implementing ``__array__`` for exactly this reason. Implicit conversions can result in silent bugs and performance degradation. 2. Use another sentinel value of some sort to indicate that a class implementing part of the higher level array API is coercible as a fallback, e.g., a return value of ``np.NotImplementedButCoercible`` from ``__array_function__``. If we take this second approach, we would need to define additional rules for how coercible array arguments are coerced, e.g., - Would we try for ``__array_function__`` overloads again after coercing coercible arguments? - If so, would we coerce coercible arguments one-at-a-time, or all-at-once? These are slightly tricky design questions, so for now we propose to defer this issue. We can always implement ``np.NotImplementedButCoercible`` at some later time if it proves critical to the numpy community in the future. Importantly, we don't think this will stop critical libraries that desire to implement most of the high level NumPy API from adopting this proposal. NOTE: If you are reading this NEP in its draft state and disagree, please speak up on the mailing list! Drawbacks of this approach -------------------------- Future difficulty extending NumPy's API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One downside of passing on all arguments directly on to ``__array_function__`` is that it makes it hard to extend the signatures of overloaded NumPy functions with new arguments, because adding even an optional keyword argument would break existing overloads. This is not a new problem for NumPy. NumPy has occasionally changed the signature for functions in the past, including functions like ``numpy.sum`` which support overloads. For adding new keyword arguments that do not change default behavior, we would only include these as keyword arguments when they have changed from default values. This is similar to `what NumPy already has done < https://github.com/numpy/numpy/blob/v1.14.2/numpy/core/fromnumeric.py#L1865-L1867 >`_, e.g., for the optional ``keepdims`` argument in ``sum``: .. code:: python def sum(array, ..., keepdims=np._NoValue): kwargs = {} if keepdims is not np._NoValue: kwargs['keepdims'] = keepdims return array.sum(..., **kwargs) In other cases, such as deprecated arguments, preserving the existing behavior of overloaded functions may not be possible. Libraries that use ``__array_function__`` should be aware of this risk: we don't propose to freeze NumPy's API in stone any more than it already is. Difficulty adding implementation specific arguments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some array implementations generally follow NumPy's API, but have additional optional keyword arguments (e.g., ``dask.array.sum()`` has ``split_every`` and ``tensorflow.reduce_sum()`` has ``name``). A generic dispatching library could potentially pass on all unrecognized keyword argument directly to the implementation, but extending ``np.sum()`` to pass on ``**kwargs`` would entail public facing changes in NumPy. Customizing the detailed behavior of array libraries will require using library specific functions, which could be limiting in the case of libraries that consume the NumPy API such as xarray. Discussion ---------- Various alternatives to this proposal were discussed in a few Github issues: 1. `pydata/sparse #1 `_ 2. `numpy/numpy #11129 `_ Additionally it was the subject of `a blogpost `_ Following this it was discussed at a `NumPy developer sprint `_ at the `UC Berkeley Institute for Data Science (BIDS) `_. References and Footnotes ------------------------ .. [1] Each NEP must either be explicitly labeled as placed in the public domain (see this NEP as an example) or licensed under the `Open Publication License`_. .. _Open Publication License: http://www.opencontent.org/openpub/ Copyright --------- This document has been placed in the public domain. [1]_ -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Sat Jun 2 19:58:23 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Sat, 2 Jun 2018 18:58:23 -0500 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Perhaps I missed this but I didn?t see: what happens when both __array_ufunc__ and __array_function__ are defined? I might want to do this to for example add support for functions like concatenate or stack to a class that already has an __array_ufunc__ defines. On Sat, Jun 2, 2018 at 5:56 PM Stephan Hoyer wrote: > Matthew Rocklin and I have written NEP-18, which proposes a new dispatch > mechanism for NumPy's high level API: > http://www.numpy.org/neps/nep-0018-array-function-protocol.html > > There has already been a little bit of scattered discussion on the pull > request (https://github.com/numpy/numpy/pull/11189), but per NEP-0 let's > try to keep high-level discussion here on the mailing list. > > The full text of the NEP is reproduced below: > > ================================================== > NEP: Dispatch Mechanism for NumPy's high level API > ================================================== > > :Author: Stephan Hoyer > :Author: Matthew Rocklin > :Status: Draft > :Type: Standards Track > :Created: 2018-05-29 > > Abstact > ------- > > We propose a protocol to allow arguments of numpy functions to define > how that function operates on them. This allows other libraries that > implement NumPy's high level API to reuse Numpy functions. This allows > libraries that extend NumPy's high level API to apply to more NumPy-like > libraries. > > Detailed description > -------------------- > > Numpy's high level ndarray API has been implemented several times > outside of NumPy itself for different architectures, such as for GPU > arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel > arrays (Dask array) as well as various Numpy-like implementations in the > deep learning frameworks, like TensorFlow and PyTorch. > > Similarly there are several projects that build on top of the Numpy API > for labeled and indexed arrays (XArray), automatic differentation > (Autograd, Tangent), higher order array factorizations (TensorLy), etc. > that add additional functionality on top of the Numpy API. > > We would like to be able to use these libraries together, for example we > would like to be able to place a CuPy array within XArray, or perform > automatic differentiation on Dask array code. This would be easier to > accomplish if code written for NumPy ndarrays could also be used by > other NumPy-like projects. > > For example, we would like for the following code example to work > equally well with any Numpy-like array object: > > .. code:: python > > def f(x): > y = np.tensordot(x, x.T) > return np.mean(np.exp(y)) > > Some of this is possible today with various protocol mechanisms within > Numpy. > > - The ``np.exp`` function checks the ``__array_ufunc__`` protocol > - The ``.T`` method works using Python's method dispatch > - The ``np.mean`` function explicitly checks for a ``.mean`` method on > the argument > > However other functions, like ``np.tensordot`` do not dispatch, and > instead are likely to coerce to a Numpy array (using the ``__array__``) > protocol, or err outright. To achieve enough coverage of the NumPy API > to support downstream projects like XArray and autograd we want to > support *almost all* functions within Numpy, which calls for a more > reaching protocol than just ``__array_ufunc__``. We would like a > protocol that allows arguments of a NumPy function to take control and > divert execution to another function (for example a GPU or parallel > implementation) in a way that is safe and consistent across projects. > > Implementation > -------------- > > We propose adding support for a new protocol in NumPy, > ``__array_function__``. > > This protocol is intended to be a catch-all for NumPy functionality that > is not covered by existing protocols, like reductions (like ``np.sum``) > or universal functions (like ``np.exp``). The semantics are very similar > to ``__array_ufunc__``, except the operation is specified by an > arbitrary callable object rather than a ufunc instance and method. > > The interface > ~~~~~~~~~~~~~ > > We propose the following signature for implementations of > ``__array_function__``: > > .. code-block:: python > > def __array_function__(self, func, types, args, kwargs) > > - ``func`` is an arbitrary callable exposed by NumPy's public API, > which was called in the form ``func(*args, **kwargs)``. > - ``types`` is a list of types for all arguments to the original NumPy > function call that will be checked for an ``__array_function__`` > implementation. > - The tuple ``args`` and dict ``**kwargs`` are directly passed on from the > original call. > > Unlike ``__array_ufunc__``, there are no high-level guarantees about the > type of ``func``, or about which of ``args`` and ``kwargs`` may contain > objects > implementing the array API. As a convenience for ``__array_function__`` > implementors of the NumPy API, the ``types`` keyword contains a list of all > types that implement the ``__array_function__`` protocol. This allows > downstream implementations to quickly determine if they are likely able to > support the operation. > > Still be determined: what guarantees can we offer for ``types``? Should > we promise that types are unique, and appear in the order in which they > are checked? > > Example for a project implementing the NumPy API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Most implementations of ``__array_function__`` will start with two > checks: > > 1. Is the given function something that we know how to overload? > 2. Are all arguments of a type that we know how to handle? > > If these conditions hold, ``__array_function__`` should return > the result from calling its implementation for ``func(*args, **kwargs)``. > Otherwise, it should return the sentinel value ``NotImplemented``, > indicating > that the function is not implemented by these types. > > .. code:: python > > class MyArray: > def __array_function__(self, func, types, args, kwargs): > if func not in HANDLED_FUNCTIONS: > return NotImplemented > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return HANDLED_FUNCTIONS[func](*args, **kwargs) > > HANDLED_FUNCTIONS = { > np.concatenate: my_concatenate, > np.broadcast_to: my_broadcast_to, > np.sum: my_sum, > ... > } > > Necessary changes within the Numpy codebase itself > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This will require two changes within the Numpy codebase: > > 1. A function to inspect available inputs, look for the > ``__array_function__`` attribute on those inputs, and call those > methods appropriately until one succeeds. This needs to be fast in the > common all-NumPy case. > > This is one additional function of moderate complexity. > 2. Calling this function within all relevant Numpy functions. > > This affects many parts of the Numpy codebase, although with very low > complexity. > > Finding and calling the right ``__array_function__`` > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a Numpy function, ``*args`` and ``**kwargs`` inputs, we need to > search through ``*args`` and ``**kwargs`` for all appropriate inputs > that might have the ``__array_function__`` attribute. Then we need to > select among those possible methods and execute the right one. > Negotiating between several possible implementations can be complex. > > Finding arguments > ''''''''''''''''' > > Valid arguments may be directly in the ``*args`` and ``**kwargs``, such > as in the case for ``np.tensordot(left, right, out=out)``, or they may > be nested within lists or dictionaries, such as in the case of > ``np.concatenate([x, y, z])``. This can be problematic for two reasons: > > 1. Some functions are given long lists of values, and traversing them > might be prohibitively expensive > 2. Some function may have arguments that we don't want to inspect, even > if they have the ``__array_function__`` method > > To resolve these we ask the functions to provide an explicit list of > arguments that should be traversed. This is the ``relevant_arguments=`` > keyword in the examples below. > > Trying ``__array_function__`` methods until the right one works > ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' > > Many arguments may implement the ``__array_function__`` protocol. Some > of these may decide that, given the available inputs, they are unable to > determine the correct result. How do we call the right one? If several > are valid then which has precedence? > > The rules for dispatch with ``__array_function__`` match those for > ``__array_ufunc__`` (see > `NEP-13 `_). > In particular: > > - NumPy will gather implementations of ``__array_function__`` from all > specified inputs and call them in order: subclasses before > superclasses, and otherwise left to right. Note that in some edge cases, > this differs slightly from the > `current behavior `_ of Python. > - Implementations of ``__array_function__`` indicate that they can > handle the operation by returning any value other than > ``NotImplemented``. > - If all ``__array_function__`` methods return ``NotImplemented``, > NumPy will raise ``TypeError``. > > Changes within Numpy functions > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a function defined above, for now call it > ``do_array_function_dance``, we now need to call that function from > within every relevant Numpy function. This is a pervasive change, but of > fairly simple and innocuous code that should complete quickly and > without effect if no arguments implement the ``__array_function__`` > protocol. Let us consider a few examples of NumPy functions and how they > might be affected by this change: > > .. code:: python > > def broadcast_to(array, shape, subok=False): > success, value = do_array_function_dance( > func=broadcast_to, > relevant_arguments=[array], > args=(array,), > kwargs=dict(shape=shape, subok=subok)) > if success: > return value > > ... # continue with the definition of broadcast_to > > def concatenate(arrays, axis=0, out=None) > success, value = do_array_function_dance( > func=concatenate, > relevant_arguments=[arrays, out], > args=(arrays,), > kwargs=dict(axis=axis, out=out)) > if success: > return value > > ... # continue with the definition of concatenate > > The list of objects passed to ``relevant_arguments`` are those that should > be inspected for ``__array_function__`` implementations. > > Alternatively, we could write these overloads with a decorator, e.g., > > .. code:: python > > @overload_for_array_function(['array']) > def broadcast_to(array, shape, subok=False): > ... # continue with the definition of broadcast_to > > @overload_for_array_function(['arrays', 'out']) > def concatenate(arrays, axis=0, out=None): > ... # continue with the definition of concatenate > > The decorator ``overload_for_array_function`` would be written in terms > of ``do_array_function_dance``. > > The downside of this approach would be a loss of introspection capability > for NumPy functions on Python 2, since this requires the use of > ``inspect.Signature`` (only available on Python 3). However, NumPy won't > be supporting Python 2 for `very much longer < > http://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html>`_. > > Use outside of NumPy > ~~~~~~~~~~~~~~~~~~~~ > > Nothing about this protocol that is particular to NumPy itself. Should > we enourage use of the same ``__array_function__`` protocol third-party > libraries for overloading non-NumPy functions, e.g., for making > array-implementation generic functionality in SciPy? > > This would offer significant advantages (SciPy wouldn't need to invent > its own dispatch system) and no downsides that we can think of, because > every function that dispatches with ``__array_function__`` already needs > to be explicitly recognized. Libraries like Dask, CuPy, and Autograd > already wrap a limited subset of SciPy functionality (e.g., > ``scipy.linalg``) similarly to how they wrap NumPy. > > If we want to do this, we should consider exposing the helper function > ``do_array_function_dance()`` above as a public API. > > Non-goals > --------- > > We are aiming for basic strategy that can be relatively mechanistically > applied to almost all functions in NumPy's API in a relatively short > period of time, the development cycle of a single NumPy release. > > We hope to get both the ``__array_function__`` protocol and all specific > overloads right on the first try, but our explicit aim here is to get > something that mostly works (and can be iterated upon), rather than to > wait for an optimal implementation. The price of moving fast is that for > now **this protocol should be considered strictly experimental**. We > reserve the right to change the details of this protocol and how > specific NumPy functions use it at any time in the future -- even in > otherwise bug-fix only releases of NumPy. > > In particular, we don't plan to write additional NEPs that list all > specific functions to overload, with exactly how they should be > overloaded. We will leave this up to the discretion of committers on > individual pull requests, trusting that they will surface any > controversies for discussion by interested parties. > > However, we already know several families of functions that should be > explicitly exclude from ``__array_function__``. These will need their > own protocols: > > - universal functions, which already have their own protocol. > - ``array`` and ``asarray``, because they are explicitly intended for > coercion to actual ``numpy.ndarray`` object. > - dispatch for methods of any kind, e.g., methods on > ``np.random.RandomState`` objects. > > As a concrete example of how we expect to break behavior in the future, > some functions such as ``np.where`` are currently not NumPy universal > functions, but conceivably could become universal functions in the > future. When/if this happens, we will change such overloads from using > ``__array_function__`` to the more specialized ``__array_ufunc__``. > > > Backward compatibility > ---------------------- > > This proposal does not change existing semantics, except for those > arguments > that currently have ``__array_function__`` methods, which should be rare. > > > Alternatives > ------------ > > Specialized protocols > ~~~~~~~~~~~~~~~~~~~~~ > > We could (and should) continue to develop protocols like > ``__array_ufunc__`` for cohesive subsets of Numpy functionality. > > As mentioned above, if this means that some functions that we overload > with ``__array_function__`` should switch to a new protocol instead, > that is explicitly OK for as long as ``__array_function__`` retains its > experimental status. > > Separate namespace > ~~~~~~~~~~~~~~~~~~ > > A separate namespace for overloaded functions is another possibility, > either inside or outside of NumPy. > > This has the advantage of alleviating any possible concerns about > backwards compatibility and would provide the maximum freedom for quick > experimentation. In the long term, it would provide a clean abstration > layer, separating NumPy's high level API from default implementations on > ``numpy.ndarray`` objects. > > The downsides are that this would require an explicit opt-in from all > existing code, e.g., ``import numpy.api as np``, and in the long term > would result in the maintainence of two separate NumPy APIs. Also, many > functions from ``numpy`` itself are already overloaded (but > inadequately), so confusion about high vs. low level APIs in NumPy would > still persist. > > Multiple dispatch > ~~~~~~~~~~~~~~~~~ > > An alternative to our suggestion of the ``__array_function__`` protocol > would be implementing NumPy's core functions as > `multi-methods `_. > Although one of us wrote a `multiple dispatch > library `_ for Python, we > don't think this approach makes sense for NumPy in the near term. > > The main reason is that NumPy already has a well-proven dispatching > mechanism with ``__array_ufunc__``, based on Python's own dispatching > system for arithemtic, and it would be confusing to add another > mechanism that works in a very different way. This would also be more > invasive change to NumPy itself, which would need to gain a multiple > dispatch implementation. > > It is possible that multiple dispatch implementation for NumPy's high > level API could make sense in the future. Fortunately, > ``__array_function__`` does not preclude this possibility, because it > would be straightforward to write a shim for a default > ``__array_function__`` implementation in terms of multiple dispatch. > > Implementations in terms of a limited core API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The internal implemenations of some NumPy functions is extremely simple. > For example: - ``np.stack()`` is implemented in only a few lines of code > by combining indexing with ``np.newaxis``, ``np.concatenate`` and the > ``shape`` attribute. - ``np.mean()`` is implemented internally in terms > of ``np.sum()``, ``np.divide()``, ``.astype()`` and ``.shape``. > > This suggests the possibility of defining a minimal "core" ndarray > interface, and relying upon it internally in NumPy to implement the full > API. This is an attractive option, because it could significantly reduce > the work required for new array implementations. > > However, this also comes with several downsides: 1. The details of how > NumPy implements a high-level function in terms of overloaded functions > now becomes an implicit part of NumPy's public API. For example, > refactoring ``stack`` to use ``np.block()`` instead of > ``np.concatenate()`` internally would now become a breaking change. 2. > Array libraries may prefer to implement high level functions differently > than NumPy. For example, a library might prefer to implement a > fundamental operations like ``mean()`` directly rather than relying on > ``sum()`` followed by division. More generally, it's not clear yet what > exactly qualifies as core functionality, and figuring this out could be > a large project. 3. We don't yet have an overloading system for > attributes and methods on array objects, e.g., for accessing ``.dtype`` > and ``.shape``. This should be the subject of a future NEP, but until > then we should be reluctant to rely on these properties. > > Given these concerns, we encourage relying on this approach only in > limited cases. > > Coersion to a NumPy array as a catch-all fallback > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > With the current design, classes that implement ``__array_function__`` > to overload at least one function implicitly declare an intent to > implement the entire NumPy API. It's not possible to implement *only* > ``np.concatenate()`` on a type, but fall back to NumPy's default > behavior of casting with ``np.asarray()`` for all other functions. > > This could present a backwards compatibility concern that would > discourage libraries from adopting ``__array_function__`` in an > incremental fashion. For example, currently most numpy functions will > implicitly convert ``pandas.Series`` objects into NumPy arrays, behavior > that assuredly many pandas users rely on. If pandas implemented > ``__array_function__`` only for ``np.concatenate``, unrelated NumPy > functions like ``np.nanmean`` would suddenly break on pandas objects by > raising TypeError. > > With ``__array_ufunc__``, it's possible to alleviate this concern by > casting all arguments to numpy arrays and re-calling the ufunc, but the > heterogeneous function signatures supported by ``__array_function__`` > make it impossible to implement this generic fallback behavior for > ``__array_function__``. > > We could resolve this issue by change the handling of return values in > ``__array_function__`` in either of two possible ways: 1. Change the > meaning of all arguments returning ``NotImplemented`` to indicate that > all arguments should be coerced to NumPy arrays instead. However, many > array libraries (e.g., scipy.sparse) really don't want implicit > conversions to NumPy arrays, and often avoid implementing ``__array__`` > for exactly this reason. Implicit conversions can result in silent bugs > and performance degradation. 2. Use another sentinel value of some sort > to indicate that a class implementing part of the higher level array API > is coercible as a fallback, e.g., a return value of > ``np.NotImplementedButCoercible`` from ``__array_function__``. > > If we take this second approach, we would need to define additional > rules for how coercible array arguments are coerced, e.g., - Would we > try for ``__array_function__`` overloads again after coercing coercible > arguments? - If so, would we coerce coercible arguments one-at-a-time, > or all-at-once? > > These are slightly tricky design questions, so for now we propose to > defer this issue. We can always implement > ``np.NotImplementedButCoercible`` at some later time if it proves > critical to the numpy community in the future. Importantly, we don't > think this will stop critical libraries that desire to implement most of > the high level NumPy API from adopting this proposal. > > NOTE: If you are reading this NEP in its draft state and disagree, > please speak up on the mailing list! > > Drawbacks of this approach > -------------------------- > > Future difficulty extending NumPy's API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > One downside of passing on all arguments directly on to > ``__array_function__`` is that it makes it hard to extend the signatures > of overloaded NumPy functions with new arguments, because adding even an > optional keyword argument would break existing overloads. > > This is not a new problem for NumPy. NumPy has occasionally changed the > signature for functions in the past, including functions like > ``numpy.sum`` which support overloads. > > For adding new keyword arguments that do not change default behavior, we > would only include these as keyword arguments when they have changed > from default values. This is similar to `what NumPy already has > done < > https://github.com/numpy/numpy/blob/v1.14.2/numpy/core/fromnumeric.py#L1865-L1867 > >`_, > e.g., for the optional ``keepdims`` argument in ``sum``: > > .. code:: python > > def sum(array, ..., keepdims=np._NoValue): > kwargs = {} > if keepdims is not np._NoValue: > kwargs['keepdims'] = keepdims > return array.sum(..., **kwargs) > > In other cases, such as deprecated arguments, preserving the existing > behavior of overloaded functions may not be possible. Libraries that use > ``__array_function__`` should be aware of this risk: we don't propose to > freeze NumPy's API in stone any more than it already is. > > Difficulty adding implementation specific arguments > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Some array implementations generally follow NumPy's API, but have > additional optional keyword arguments (e.g., ``dask.array.sum()`` has > ``split_every`` and ``tensorflow.reduce_sum()`` has ``name``). A generic > dispatching library could potentially pass on all unrecognized keyword > argument directly to the implementation, but extending ``np.sum()`` to > pass on ``**kwargs`` would entail public facing changes in NumPy. > Customizing the detailed behavior of array libraries will require using > library specific functions, which could be limiting in the case of > libraries that consume the NumPy API such as xarray. > > > Discussion > ---------- > > Various alternatives to this proposal were discussed in a few Github > issues: > > 1. `pydata/sparse #1 `_ > 2. `numpy/numpy #11129 `_ > > Additionally it was the subject of `a blogpost > `_ Following > this > it was discussed at a `NumPy developer sprint > `_ at the `UC > Berkeley Institute for Data Science (BIDS) `_. > > > References and Footnotes > ------------------------ > > .. [1] Each NEP must either be explicitly labeled as placed in the public > domain (see > this NEP as an example) or licensed under the `Open Publication > License`_. > > .. _Open Publication License: http://www.opencontent.org/openpub/ > > > Copyright > --------- > > This document has been placed in the public domain. [1]_ > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Sun Jun 3 00:45:55 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sat, 2 Jun 2018 21:45:55 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Perhaps I missed this but I didn?t see: what happens when both __array_ufunc__ and __array_function__ are defined? I might want to do this to for example add support for functions like concatenate or stack to a class that already has an __array_ufunc__ defines. This is mentioned in the section ?Non-goals?, which says that ufuncs and their methods should be excluded, along with a few other classes of functions/methods. Sent from Astro for Mac -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 3 11:19:01 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 3 Jun 2018 11:19:01 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hi Stephan, Thanks for posting. Overall, this is great! My more general comment is one of speed: for *normal* operation performance should be impacted as minimally as possible. I think this is a serious issue and feel strongly it *has* to be possible to avoid all arguments being checked for the `__array_function__` attribute, i.e., there should be an obvious way to ensure no type checking dance is done. Some possible solutions (which I think should be in the NEP, even if as discounted options): A. Two "namespaces", one for the undecorated base functions, and one completely trivial one for the decorated ones. The idea would be that if one knows one is dealing with arrays only, one would do `import numpy.array_only as np` (i.e., the reverse of the suggestion currently in the NEP, where the decorated ones are in their own namespace - I agree with the reasons for discounting that one). Note that in this suggestion the array-only namespace serves as the one used for `ndarray.__array_function__`. B. Automatic insertion by the decorator of an `array_only=np._NoValue` (or `coerce` and perhaps `subok=...` if not present) in the function signature, so that users who know that they have arrays only could pass `array_only=True` (name to be decided). This would be most useful if there were also some type of configuration parameter that could set the default of `array_only`. Note that both A and B could also address, at least partially, the problem of sometimes wanting to just use the old coercion methods, i.e., not having to implement every possible numpy function in one go in a new `__array_function__` on one's class. Two other general comments: 1. I'm rather unclear about the use of `types`. It can help me decide what to do, but I would still have to find the argument in question (e.g., for Quantity, the unit of the relevant argument). I'd recommend passing instead a tuple of all arguments that were inspected, in the inspection order; after all, it is just a `arg.__class__` away from the type, and in your example you'd only have to replace `issubclass` by `isinstance`. 2. For subclasses, it would be very handy to have `ndarray.__array_function__`, so one can call super after changing arguments. (For `__array_ufunc__`, there was lots of question about whether this was useful, but it really is!!). [I think you already agreed with this, but want to have it in-place, as for subclasses of ndarray this is just as useful as it would be for subclasses of dask arrays.) Note that any `ndarray.__array_function__` might also help solve the problem of cases where coercion is fine: it could have an extra keyword argument (say `coerce`) that would call the function with coercion in place. Indeed, if the `ndarray.__array_function__` were used inside the "dance" function, and then the actual implementation of a given function would just be a separate, private one. Again, overall a great idea, and thanks to all those involved for taking it on. All the best, Marten On Sat, Jun 2, 2018 at 6:55 PM, Stephan Hoyer wrote: > Matthew Rocklin and I have written NEP-18, which proposes a new dispatch > mechanism for NumPy's high level API: http://www.numpy.org/neps/nep- > 0018-array-function-protocol.html > > There has already been a little bit of scattered discussion on the pull > request (https://github.com/numpy/numpy/pull/11189), but per NEP-0 let's > try to keep high-level discussion here on the mailing list. > > The full text of the NEP is reproduced below: > > ================================================== > NEP: Dispatch Mechanism for NumPy's high level API > ================================================== > > :Author: Stephan Hoyer > :Author: Matthew Rocklin > :Status: Draft > :Type: Standards Track > :Created: 2018-05-29 > > Abstact > ------- > > We propose a protocol to allow arguments of numpy functions to define > how that function operates on them. This allows other libraries that > implement NumPy's high level API to reuse Numpy functions. This allows > libraries that extend NumPy's high level API to apply to more NumPy-like > libraries. > > Detailed description > -------------------- > > Numpy's high level ndarray API has been implemented several times > outside of NumPy itself for different architectures, such as for GPU > arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel > arrays (Dask array) as well as various Numpy-like implementations in the > deep learning frameworks, like TensorFlow and PyTorch. > > Similarly there are several projects that build on top of the Numpy API > for labeled and indexed arrays (XArray), automatic differentation > (Autograd, Tangent), higher order array factorizations (TensorLy), etc. > that add additional functionality on top of the Numpy API. > > We would like to be able to use these libraries together, for example we > would like to be able to place a CuPy array within XArray, or perform > automatic differentiation on Dask array code. This would be easier to > accomplish if code written for NumPy ndarrays could also be used by > other NumPy-like projects. > > For example, we would like for the following code example to work > equally well with any Numpy-like array object: > > .. code:: python > > def f(x): > y = np.tensordot(x, x.T) > return np.mean(np.exp(y)) > > Some of this is possible today with various protocol mechanisms within > Numpy. > > - The ``np.exp`` function checks the ``__array_ufunc__`` protocol > - The ``.T`` method works using Python's method dispatch > - The ``np.mean`` function explicitly checks for a ``.mean`` method on > the argument > > However other functions, like ``np.tensordot`` do not dispatch, and > instead are likely to coerce to a Numpy array (using the ``__array__``) > protocol, or err outright. To achieve enough coverage of the NumPy API > to support downstream projects like XArray and autograd we want to > support *almost all* functions within Numpy, which calls for a more > reaching protocol than just ``__array_ufunc__``. We would like a > protocol that allows arguments of a NumPy function to take control and > divert execution to another function (for example a GPU or parallel > implementation) in a way that is safe and consistent across projects. > > Implementation > -------------- > > We propose adding support for a new protocol in NumPy, > ``__array_function__``. > > This protocol is intended to be a catch-all for NumPy functionality that > is not covered by existing protocols, like reductions (like ``np.sum``) > or universal functions (like ``np.exp``). The semantics are very similar > to ``__array_ufunc__``, except the operation is specified by an > arbitrary callable object rather than a ufunc instance and method. > > The interface > ~~~~~~~~~~~~~ > > We propose the following signature for implementations of > ``__array_function__``: > > .. code-block:: python > > def __array_function__(self, func, types, args, kwargs) > > - ``func`` is an arbitrary callable exposed by NumPy's public API, > which was called in the form ``func(*args, **kwargs)``. > - ``types`` is a list of types for all arguments to the original NumPy > function call that will be checked for an ``__array_function__`` > implementation. > - The tuple ``args`` and dict ``**kwargs`` are directly passed on from the > original call. > > Unlike ``__array_ufunc__``, there are no high-level guarantees about the > type of ``func``, or about which of ``args`` and ``kwargs`` may contain > objects > implementing the array API. As a convenience for ``__array_function__`` > implementors of the NumPy API, the ``types`` keyword contains a list of all > types that implement the ``__array_function__`` protocol. This allows > downstream implementations to quickly determine if they are likely able to > support the operation. > > Still be determined: what guarantees can we offer for ``types``? Should > we promise that types are unique, and appear in the order in which they > are checked? > > Example for a project implementing the NumPy API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Most implementations of ``__array_function__`` will start with two > checks: > > 1. Is the given function something that we know how to overload? > 2. Are all arguments of a type that we know how to handle? > > If these conditions hold, ``__array_function__`` should return > the result from calling its implementation for ``func(*args, **kwargs)``. > Otherwise, it should return the sentinel value ``NotImplemented``, > indicating > that the function is not implemented by these types. > > .. code:: python > > class MyArray: > def __array_function__(self, func, types, args, kwargs): > if func not in HANDLED_FUNCTIONS: > return NotImplemented > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return HANDLED_FUNCTIONS[func](*args, **kwargs) > > HANDLED_FUNCTIONS = { > np.concatenate: my_concatenate, > np.broadcast_to: my_broadcast_to, > np.sum: my_sum, > ... > } > > Necessary changes within the Numpy codebase itself > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This will require two changes within the Numpy codebase: > > 1. A function to inspect available inputs, look for the > ``__array_function__`` attribute on those inputs, and call those > methods appropriately until one succeeds. This needs to be fast in the > common all-NumPy case. > > This is one additional function of moderate complexity. > 2. Calling this function within all relevant Numpy functions. > > This affects many parts of the Numpy codebase, although with very low > complexity. > > Finding and calling the right ``__array_function__`` > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a Numpy function, ``*args`` and ``**kwargs`` inputs, we need to > search through ``*args`` and ``**kwargs`` for all appropriate inputs > that might have the ``__array_function__`` attribute. Then we need to > select among those possible methods and execute the right one. > Negotiating between several possible implementations can be complex. > > Finding arguments > ''''''''''''''''' > > Valid arguments may be directly in the ``*args`` and ``**kwargs``, such > as in the case for ``np.tensordot(left, right, out=out)``, or they may > be nested within lists or dictionaries, such as in the case of > ``np.concatenate([x, y, z])``. This can be problematic for two reasons: > > 1. Some functions are given long lists of values, and traversing them > might be prohibitively expensive > 2. Some function may have arguments that we don't want to inspect, even > if they have the ``__array_function__`` method > > To resolve these we ask the functions to provide an explicit list of > arguments that should be traversed. This is the ``relevant_arguments=`` > keyword in the examples below. > > Trying ``__array_function__`` methods until the right one works > ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' > > Many arguments may implement the ``__array_function__`` protocol. Some > of these may decide that, given the available inputs, they are unable to > determine the correct result. How do we call the right one? If several > are valid then which has precedence? > > The rules for dispatch with ``__array_function__`` match those for > ``__array_ufunc__`` (see > `NEP-13 `_). > In particular: > > - NumPy will gather implementations of ``__array_function__`` from all > specified inputs and call them in order: subclasses before > superclasses, and otherwise left to right. Note that in some edge cases, > this differs slightly from the > `current behavior `_ of Python. > - Implementations of ``__array_function__`` indicate that they can > handle the operation by returning any value other than > ``NotImplemented``. > - If all ``__array_function__`` methods return ``NotImplemented``, > NumPy will raise ``TypeError``. > > Changes within Numpy functions > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a function defined above, for now call it > ``do_array_function_dance``, we now need to call that function from > within every relevant Numpy function. This is a pervasive change, but of > fairly simple and innocuous code that should complete quickly and > without effect if no arguments implement the ``__array_function__`` > protocol. Let us consider a few examples of NumPy functions and how they > might be affected by this change: > > .. code:: python > > def broadcast_to(array, shape, subok=False): > success, value = do_array_function_dance( > func=broadcast_to, > relevant_arguments=[array], > args=(array,), > kwargs=dict(shape=shape, subok=subok)) > if success: > return value > > ... # continue with the definition of broadcast_to > > def concatenate(arrays, axis=0, out=None) > success, value = do_array_function_dance( > func=concatenate, > relevant_arguments=[arrays, out], > args=(arrays,), > kwargs=dict(axis=axis, out=out)) > if success: > return value > > ... # continue with the definition of concatenate > > The list of objects passed to ``relevant_arguments`` are those that should > be inspected for ``__array_function__`` implementations. > > Alternatively, we could write these overloads with a decorator, e.g., > > .. code:: python > > @overload_for_array_function(['array']) > def broadcast_to(array, shape, subok=False): > ... # continue with the definition of broadcast_to > > @overload_for_array_function(['arrays', 'out']) > def concatenate(arrays, axis=0, out=None): > ... # continue with the definition of concatenate > > The decorator ``overload_for_array_function`` would be written in terms > of ``do_array_function_dance``. > > The downside of this approach would be a loss of introspection capability > for NumPy functions on Python 2, since this requires the use of > ``inspect.Signature`` (only available on Python 3). However, NumPy won't > be supporting Python 2 for `very much longer nep-0014-dropping-python2.7-proposal.html>`_. > > Use outside of NumPy > ~~~~~~~~~~~~~~~~~~~~ > > Nothing about this protocol that is particular to NumPy itself. Should > we enourage use of the same ``__array_function__`` protocol third-party > libraries for overloading non-NumPy functions, e.g., for making > array-implementation generic functionality in SciPy? > > This would offer significant advantages (SciPy wouldn't need to invent > its own dispatch system) and no downsides that we can think of, because > every function that dispatches with ``__array_function__`` already needs > to be explicitly recognized. Libraries like Dask, CuPy, and Autograd > already wrap a limited subset of SciPy functionality (e.g., > ``scipy.linalg``) similarly to how they wrap NumPy. > > If we want to do this, we should consider exposing the helper function > ``do_array_function_dance()`` above as a public API. > > Non-goals > --------- > > We are aiming for basic strategy that can be relatively mechanistically > applied to almost all functions in NumPy's API in a relatively short > period of time, the development cycle of a single NumPy release. > > We hope to get both the ``__array_function__`` protocol and all specific > overloads right on the first try, but our explicit aim here is to get > something that mostly works (and can be iterated upon), rather than to > wait for an optimal implementation. The price of moving fast is that for > now **this protocol should be considered strictly experimental**. We > reserve the right to change the details of this protocol and how > specific NumPy functions use it at any time in the future -- even in > otherwise bug-fix only releases of NumPy. > > In particular, we don't plan to write additional NEPs that list all > specific functions to overload, with exactly how they should be > overloaded. We will leave this up to the discretion of committers on > individual pull requests, trusting that they will surface any > controversies for discussion by interested parties. > > However, we already know several families of functions that should be > explicitly exclude from ``__array_function__``. These will need their > own protocols: > > - universal functions, which already have their own protocol. > - ``array`` and ``asarray``, because they are explicitly intended for > coercion to actual ``numpy.ndarray`` object. > - dispatch for methods of any kind, e.g., methods on > ``np.random.RandomState`` objects. > > As a concrete example of how we expect to break behavior in the future, > some functions such as ``np.where`` are currently not NumPy universal > functions, but conceivably could become universal functions in the > future. When/if this happens, we will change such overloads from using > ``__array_function__`` to the more specialized ``__array_ufunc__``. > > > Backward compatibility > ---------------------- > > This proposal does not change existing semantics, except for those > arguments > that currently have ``__array_function__`` methods, which should be rare. > > > Alternatives > ------------ > > Specialized protocols > ~~~~~~~~~~~~~~~~~~~~~ > > We could (and should) continue to develop protocols like > ``__array_ufunc__`` for cohesive subsets of Numpy functionality. > > As mentioned above, if this means that some functions that we overload > with ``__array_function__`` should switch to a new protocol instead, > that is explicitly OK for as long as ``__array_function__`` retains its > experimental status. > > Separate namespace > ~~~~~~~~~~~~~~~~~~ > > A separate namespace for overloaded functions is another possibility, > either inside or outside of NumPy. > > This has the advantage of alleviating any possible concerns about > backwards compatibility and would provide the maximum freedom for quick > experimentation. In the long term, it would provide a clean abstration > layer, separating NumPy's high level API from default implementations on > ``numpy.ndarray`` objects. > > The downsides are that this would require an explicit opt-in from all > existing code, e.g., ``import numpy.api as np``, and in the long term > would result in the maintainence of two separate NumPy APIs. Also, many > functions from ``numpy`` itself are already overloaded (but > inadequately), so confusion about high vs. low level APIs in NumPy would > still persist. > > Multiple dispatch > ~~~~~~~~~~~~~~~~~ > > An alternative to our suggestion of the ``__array_function__`` protocol > would be implementing NumPy's core functions as > `multi-methods `_. > Although one of us wrote a `multiple dispatch > library `_ for Python, we > don't think this approach makes sense for NumPy in the near term. > > The main reason is that NumPy already has a well-proven dispatching > mechanism with ``__array_ufunc__``, based on Python's own dispatching > system for arithemtic, and it would be confusing to add another > mechanism that works in a very different way. This would also be more > invasive change to NumPy itself, which would need to gain a multiple > dispatch implementation. > > It is possible that multiple dispatch implementation for NumPy's high > level API could make sense in the future. Fortunately, > ``__array_function__`` does not preclude this possibility, because it > would be straightforward to write a shim for a default > ``__array_function__`` implementation in terms of multiple dispatch. > > Implementations in terms of a limited core API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The internal implemenations of some NumPy functions is extremely simple. > For example: - ``np.stack()`` is implemented in only a few lines of code > by combining indexing with ``np.newaxis``, ``np.concatenate`` and the > ``shape`` attribute. - ``np.mean()`` is implemented internally in terms > of ``np.sum()``, ``np.divide()``, ``.astype()`` and ``.shape``. > > This suggests the possibility of defining a minimal "core" ndarray > interface, and relying upon it internally in NumPy to implement the full > API. This is an attractive option, because it could significantly reduce > the work required for new array implementations. > > However, this also comes with several downsides: 1. The details of how > NumPy implements a high-level function in terms of overloaded functions > now becomes an implicit part of NumPy's public API. For example, > refactoring ``stack`` to use ``np.block()`` instead of > ``np.concatenate()`` internally would now become a breaking change. 2. > Array libraries may prefer to implement high level functions differently > than NumPy. For example, a library might prefer to implement a > fundamental operations like ``mean()`` directly rather than relying on > ``sum()`` followed by division. More generally, it's not clear yet what > exactly qualifies as core functionality, and figuring this out could be > a large project. 3. We don't yet have an overloading system for > attributes and methods on array objects, e.g., for accessing ``.dtype`` > and ``.shape``. This should be the subject of a future NEP, but until > then we should be reluctant to rely on these properties. > > Given these concerns, we encourage relying on this approach only in > limited cases. > > Coersion to a NumPy array as a catch-all fallback > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > With the current design, classes that implement ``__array_function__`` > to overload at least one function implicitly declare an intent to > implement the entire NumPy API. It's not possible to implement *only* > ``np.concatenate()`` on a type, but fall back to NumPy's default > behavior of casting with ``np.asarray()`` for all other functions. > > This could present a backwards compatibility concern that would > discourage libraries from adopting ``__array_function__`` in an > incremental fashion. For example, currently most numpy functions will > implicitly convert ``pandas.Series`` objects into NumPy arrays, behavior > that assuredly many pandas users rely on. If pandas implemented > ``__array_function__`` only for ``np.concatenate``, unrelated NumPy > functions like ``np.nanmean`` would suddenly break on pandas objects by > raising TypeError. > > With ``__array_ufunc__``, it's possible to alleviate this concern by > casting all arguments to numpy arrays and re-calling the ufunc, but the > heterogeneous function signatures supported by ``__array_function__`` > make it impossible to implement this generic fallback behavior for > ``__array_function__``. > > We could resolve this issue by change the handling of return values in > ``__array_function__`` in either of two possible ways: 1. Change the > meaning of all arguments returning ``NotImplemented`` to indicate that > all arguments should be coerced to NumPy arrays instead. However, many > array libraries (e.g., scipy.sparse) really don't want implicit > conversions to NumPy arrays, and often avoid implementing ``__array__`` > for exactly this reason. Implicit conversions can result in silent bugs > and performance degradation. 2. Use another sentinel value of some sort > to indicate that a class implementing part of the higher level array API > is coercible as a fallback, e.g., a return value of > ``np.NotImplementedButCoercible`` from ``__array_function__``. > > If we take this second approach, we would need to define additional > rules for how coercible array arguments are coerced, e.g., - Would we > try for ``__array_function__`` overloads again after coercing coercible > arguments? - If so, would we coerce coercible arguments one-at-a-time, > or all-at-once? > > These are slightly tricky design questions, so for now we propose to > defer this issue. We can always implement > ``np.NotImplementedButCoercible`` at some later time if it proves > critical to the numpy community in the future. Importantly, we don't > think this will stop critical libraries that desire to implement most of > the high level NumPy API from adopting this proposal. > > NOTE: If you are reading this NEP in its draft state and disagree, > please speak up on the mailing list! > > Drawbacks of this approach > -------------------------- > > Future difficulty extending NumPy's API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > One downside of passing on all arguments directly on to > ``__array_function__`` is that it makes it hard to extend the signatures > of overloaded NumPy functions with new arguments, because adding even an > optional keyword argument would break existing overloads. > > This is not a new problem for NumPy. NumPy has occasionally changed the > signature for functions in the past, including functions like > ``numpy.sum`` which support overloads. > > For adding new keyword arguments that do not change default behavior, we > would only include these as keyword arguments when they have changed > from default values. This is similar to `what NumPy already has > done fromnumeric.py#L1865-L1867>`_, > e.g., for the optional ``keepdims`` argument in ``sum``: > > .. code:: python > > def sum(array, ..., keepdims=np._NoValue): > kwargs = {} > if keepdims is not np._NoValue: > kwargs['keepdims'] = keepdims > return array.sum(..., **kwargs) > > In other cases, such as deprecated arguments, preserving the existing > behavior of overloaded functions may not be possible. Libraries that use > ``__array_function__`` should be aware of this risk: we don't propose to > freeze NumPy's API in stone any more than it already is. > > Difficulty adding implementation specific arguments > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Some array implementations generally follow NumPy's API, but have > additional optional keyword arguments (e.g., ``dask.array.sum()`` has > ``split_every`` and ``tensorflow.reduce_sum()`` has ``name``). A generic > dispatching library could potentially pass on all unrecognized keyword > argument directly to the implementation, but extending ``np.sum()`` to > pass on ``**kwargs`` would entail public facing changes in NumPy. > Customizing the detailed behavior of array libraries will require using > library specific functions, which could be limiting in the case of > libraries that consume the NumPy API such as xarray. > > > Discussion > ---------- > > Various alternatives to this proposal were discussed in a few Github > issues: > > 1. `pydata/sparse #1 `_ > 2. `numpy/numpy #11129 `_ > > Additionally it was the subject of `a blogpost > `_ Following > this > it was discussed at a `NumPy developer sprint > `_ at the `UC > Berkeley Institute for Data Science (BIDS) `_. > > > References and Footnotes > ------------------------ > > .. [1] Each NEP must either be explicitly labeled as placed in the public > domain (see > this NEP as an example) or licensed under the `Open Publication > License`_. > > .. _Open Publication License: http://www.opencontent.org/openpub/ > > > Copyright > --------- > > This document has been placed in the public domain. [1]_ > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Sun Jun 3 14:00:32 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sun, 3 Jun 2018 11:00:32 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: The rules for dispatch with ``__array_function__`` match those for ``__array_ufunc__`` (see `NEP-13 `_). In particular: - NumPy will gather implementations of ``__array_function__`` from all specified inputs and call them in order: subclasses before superclasses, and otherwise left to right. Note that in some edge cases, this differs slightly from the `current behavior `_ of Python. - Implementations of ``__array_function__`` indicate that they can handle the operation by returning any value other than ``NotImplemented``. - If all ``__array_function__`` methods return ``NotImplemented``, NumPy will raise ``TypeError``. I?d like to propose two changes to this: - ``np.NotImplementedButCoercible`` be a part of the standard from the start. - If all implementations return this, only then should it be coerced. - In the future, it might be good to mark something as coercible to coerce it to ``ndarray`` before passing to another object?s ``__array_ufunc__``. - This is necessary if libraries want to keep old behaviour for some functions, while overriding others. - Otherwise they have to implement overloads for all functions. This seems rather like an all-or-nothing choice, which I?d like to avoid. - It isn?t too hard to implement in practice. - Objects that don?t implement ``__array_function__`` should be treated as having returned ``np.NotImplementedButCoercible``. - This has the effect of coercing ``list``, etc. - At a minimum, to maintain compatibility, if all objects don?t implement ``__array_function__``, the old behaviour should stay. Also, I?m +1 on Marten?s suggestion that ``ndarray`` itself should implement ``__array_function__``. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 3 14:10:37 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 3 Jun 2018 14:10:37 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 2:00 PM, Hameer Abbasi wrote: > The rules for dispatch with ``__array_function__`` match those for > ``__array_ufunc__`` (see > `NEP-13 `_). > In particular: > > - NumPy will gather implementations of ``__array_function__`` from all > specified inputs and call them in order: subclasses before > superclasses, and otherwise left to right. Note that in some edge cases, > this differs slightly from the > `current behavior `_ of Python. > - Implementations of ``__array_function__`` indicate that they can > handle the operation by returning any value other than > ``NotImplemented``. > - If all ``__array_function__`` methods return ``NotImplemented``, > NumPy will raise ``TypeError``. > > > I?d like to propose two changes to this: > > - ``np.NotImplementedButCoercible`` be a part of the standard from the > start. > - If all implementations return this, only then should it be > coerced. > - In the future, it might be good to mark something as coercible > to coerce it to ``ndarray`` before passing to another object?s > ``__array_ufunc__``. > - This is necessary if libraries want to keep old behaviour for > some functions, while overriding others. > - Otherwise they have to implement overloads for all functions. > This seems rather like an all-or-nothing choice, which I?d like to avoid. > - It isn?t too hard to implement in practice. > > I think the issue is real but I would be slightly worried about adding multiple possible things to return - there is a benefit to an answer being either "I cannot do this" or "here's the result". I also am not sure there is an actual problem: In the scheme as proposed, implementations could just coerce themselves to array and call the routine again. (Or, in the scheme I proposed, call the routine again but with `coerce=True`.) > > - Objects that don?t implement ``__array_function__`` should be > treated as having returned ``np.NotImplementedButCoercible``. > - This has the effect of coercing ``list``, etc. > - At a minimum, to maintain compatibility, if all objects don?t > implement ``__array_function__``, the old behaviour should stay. > > I think that in the proposed scheme this is effectively what happens. -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Sun Jun 3 14:52:43 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sun, 3 Jun 2018 11:52:43 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: I also am not sure there is an actual problem: In the scheme as proposed, implementations could just coerce themselves to array and call the routine again. (Or, in the scheme I proposed, call the routine again but with `coerce=True`.) Ah, I didn?t think of the first solution. `coerce=True` may not produce the desired solution in cases where some arguments can be coerced and some can?t. However, such a design may still have some benefits. For example: - ``array1.HANDLED_TYPES = [array1]`` - ``array2.HANDLED_TYPES = [array1, array2]`` - ``array1`` is coercible. - None of these is a sub/super class of the other or of ``ndarray`` - When calling ``np.func(array1(), array2())``, ``array1`` would be coerced with your solution (because of the left-to-right rule and ``array1`` choosing to coerce itself) but not with ``np.NotImplementedButCoercible``. I think that in the proposed scheme this is effectively what happens. Not really, the current scheme is unclear on what happens if none of the arguments implement ``__array_function__`` (or at least it doesn?t explicitly state it that I can see). -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 19:00:08 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 16:00:08 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 8:19 AM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > My more general comment is one of speed: for *normal* operation > performance should be impacted as minimally as possible. I think this is a > serious issue and feel strongly it *has* to be possible to avoid all > arguments being checked for the `__array_function__` attribute, i.e., there > should be an obvious way to ensure no type checking dance is done. > I agree that all we should try minimize the impact of dispatching on normal operations. It would be helpful to identify examples of real workflows, so we can measure the impact of doing these checks empirically. That said, I think a small degradation in performance for code that works with small arrays should be acceptable, because performance is an already an accepted limitations of using NumPy/Python for these use cases. In most cases, I suspect that the overhead of a function call and checking several arguments for "__array_function__" will be negligible, like the situation for __array_ufunc__. I'm not strongly opposed to either of your proposed solutions, but I do think it would be a little strange to insist that we need a solution for __array_function__ when __array_ufunc__ was fine. > A. Two "namespaces", one for the undecorated base functions, and one > completely trivial one for the decorated ones. The idea would be that if > one knows one is dealing with arrays only, one would do `import > numpy.array_only as np` (i.e., the reverse of the suggestion currently in > the NEP, where the decorated ones are in their own namespace - I agree with > the reasons for discounting that one). > I will mention this as a possibility. I do think there is something to be said for clear separation of overloaded and non-overloaded APIs. But f I were to choose between adding numpy.api and numpy.array_only, I would pick numpy.api, because of the virtue of preserving the existing numpy namespace as it currently exists. > B. Automatic insertion by the decorator of an `array_only=np._NoValue` (or > `coerce` and perhaps `subok=...` if not present) in the function signature, > so that users who know that they have arrays only could pass > `array_only=True` (name to be decided). > Rather than adding another argument to every NumPy function, I would rather encourage writing np.asarray() explicitly. > Note that both A and B could also address, at least partially, the problem > of sometimes wanting to just use the old coercion methods, i.e., not having > to implement every possible numpy function in one go in a new > `__array_function__` on one's class. > Yes, agreed. > 1. I'm rather unclear about the use of `types`. It can help me decide what > to do, but I would still have to find the argument in question (e.g., for > Quantity, the unit of the relevant argument). I'd recommend passing instead > a tuple of all arguments that were inspected, in the inspection order; > after all, it is just a `arg.__class__` away from the type, and in your > example you'd only have to replace `issubclass` by `isinstance`. > The virtue of a `types` argument is that we can deduplicate arguments once, rather than in each __array_function__ check. This could result in significantly more efficient code, e.g,. when np.concatenate() is called on 10,000 arrays with only two unique types, we don't need to loop through all 10,000 again objects to check that overloading is valid. Even for Quantity, I suspect you will want two layers of checks: 1. A check to verify that every argument is a Quantity (or something coercible to a Quantity). This could use `types` and return `NotImplemented` when it fails. 2. A check to verify that units match. This will have custom logic for different operations and will require checking all arguments -- not just their unique types. For many Quantity functions, the second check will indeed probably be super simple (i.e., verifying that all units match). But the first check (with `types`) really is something that basically very overload should do. > 2. For subclasses, it would be very handy to have > `ndarray.__array_function__`, so one can call super after changing > arguments. (For `__array_ufunc__`, there was lots of question about whether > this was useful, but it really is!!). [I think you already agreed with > this, but want to have it in-place, as for subclasses of ndarray this is > just as useful as it would be for subclasses of dask arrays.) > Yes, indeed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From warren.weckesser at gmail.com Sun Jun 3 19:09:41 2018 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Sun, 3 Jun 2018 19:09:41 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 2, 2018 at 3:04 PM, Robert Kern wrote: > As promised distressingly many months ago, I have written up a NEP about > relaxing the stream-compatibility policy that we currently have. > > https://github.com/numpy/numpy/pull/11229 > https://github.com/rkern/numpy/blob/nep/rng/doc/neps/ > nep-0019-rng-policy.rst > > I particularly invite comment on the two lists of methods that we still > would make strict compatibility guarantees for. > > --- > Thanks, Robert. It looks like you are neatly cutting the Gordian Knot of API versioning in numpy.random! I don't have any specific comments, except that it will be great to have *something* other than the status quo, so we can starting improving the existing numpy.random functions. Warren > ============================== > Random Number Generator Policy > ============================== > > :Author: Robert Kern > :Status: Draft > :Type: Standards Track > :Created: 2018-05-24 > > > Abstract > -------- > > For the past decade, NumPy has had a strict backwards compatibility policy > for > the number stream of all of its random number distributions. Unlike other > numerical components in ``numpy``, which are usually allowed to return > different when results when they are modified if they remain correct, we > have > obligated the random number distributions to always produce the exact same > numbers in every version. The objective of our stream-compatibility > guarantee > was to provide exact reproducibility for simulations across numpy versions > in > order to promote reproducible research. However, this policy has made it > very > difficult to enhance any of the distributions with faster or more accurate > algorithms. After a decade of experience and improvements in the > surrounding > ecosystem of scientific software, we believe that there are now better > ways to > achieve these objectives. We propose relaxing our strict > stream-compatibility > policy to remove the obstacles that are in the way of accepting > contributions > to our random number generation capabilities. > > > The Status Quo > -------------- > > Our current policy, in full: > > A fixed seed and a fixed series of calls to ``RandomState`` methods > using the > same parameters will always produce the same results up to roundoff > error > except when the values were incorrect. Incorrect values will be fixed > and > the NumPy version in which the fix was made will be noted in the > relevant > docstring. Extension of existing parameter ranges and the addition of > new > parameters is allowed as long the previous behavior remains unchanged. > > This policy was first instated in Nov 2008 (in essence; the full set of > weasel > words grew over time) in response to a user wanting to be sure that the > simulations that formed the basis of their scientific publication could be > reproduced years later, exactly, with whatever version of ``numpy`` that > was > current at the time. We were keen to support reproducible research, and > it was > still early in the life of ``numpy.random``. We had not seen much cause to > change the distribution methods all that much. > > We also had not thought very thoroughly about the limits of what we really > could promise (and by ?we? in this section, we really mean Robert Kern, > let?s > be honest). Despite all of the weasel words, our policy overpromises > compatibility. The same version of ``numpy`` built on different > platforms, or > just in a different way could cause changes in the stream, with varying > degrees > of rarity. The biggest is that the ``.multivariate_normal()`` method > relies on > ``numpy.linalg`` functions. Even on the same platform, if one links > ``numpy`` > with a different LAPACK, ``.multivariate_normal()`` may well return > completely > different results. More rarely, building on a different OS or CPU can > cause > differences in the stream. We use C ``long`` integers internally for > integer > distribution (it seemed like a good idea at the time), and those can vary > in > size depending on the platform. Distribution methods can overflow their > internal C ``longs`` at different breakpoints depending on the platform and > cause all of the random variate draws that follow to be different. > > And even if all of that is controlled, our policy still does not provide > exact > guarantees across versions. We still do apply bug fixes when correctness > is at > stake. And even if we didn?t do that, any nontrivial program does more > than > just draw random numbers. They do computations on those numbers, transform > those with numerical algorithms from the rest of ``numpy``, which is not > subject to so strict a policy. Trying to maintain stream-compatibility > for our > random number distributions does not help reproducible research for these > reasons. > > The standard practice now for bit-for-bit reproducible research is to pin > all > of the versions of code of your software stack, possibly down to the OS > itself. > The landscape for accomplishing this is much easier today than it was in > 2008. > We now have ``pip``. We now have virtual machines. Those who need to > reproduce simulations exactly now can (and ought to) do so by using the > exact > same version of ``numpy``. We do not need to maintain stream-compatibility > across ``numpy`` versions to help them. > > Our stream-compatibility guarantee has hindered our ability to make > improvements to ``numpy.random``. Several first-time contributors have > submitted PRs to improve the distributions, usually by implementing a > faster, > or more accurate algorithm than the one that is currently there. > Unfortunately, most of them would have required breaking the stream to do > so. > Blocked by our policy, and our inability to work around that policy, many > of > those contributors simply walked away. > > > Implementation > -------------- > > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. Deprecation warnings will start silent but > become > increasingly noisy over time. Bugs in the current state of the code will > *not* > be fixed if fixing them would impact the stream. However, if changes in > the > rest of ``numpy`` would break something in the ``RandomState`` code, we > will > fix ``RandomState`` to continue working (for example, some change in the > C API). No new features will be added to ``RandomState``. Users should > migrate to the new subsystem as they are able to. > > Work on a proposed `new PRNG subsystem > `_ is already underway. The > specifics > of the new design are out of scope for this NEP and up for much > discussion, but > we will discuss general policies that will guide the evolution of whatever > code > is adopted. > > First, we will maintain API source compatibility just as we do with the > rest of > ``numpy``. If we *must* make a breaking change, we will only do so with an > appropriate deprecation period and warnings. > > Second, breaking stream-compatibility in order to introduce new features or > improve performance will be *allowed* with *caution*. Such changes will be > considered features, and as such will be no faster than the standard > release > cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness > is > not a bug. Correctness bug fixes that break stream-compatibility can > happen on > bugfix releases, per usual, but developers should consider if they can wait > until the next feature release. We encourage developers to strongly weight > user?s pain from the break in stream-compatibility against the > improvements. > One example of a worthwhile improvement would be to change algorithms for > a significant increase in performance, for example, moving from the > `Box-Muller > transform `_ > method > of Gaussian variate generation to the faster `Ziggurat algorithm > `_. An example of an > unworthy improvement would be tweaking the Ziggurat tables just a little > bit. > > Any new design for the RNG subsystem will provide a choice of different > core > uniform PRNG algorithms. We will be more strict about a select subset of > methods on these core PRNG objects. They MUST guarantee > stream-compatibility > for a minimal, specified set of methods which are chosen to make it easier > to > compose them to build other distributions. Namely, > > * ``.bytes()`` > * ``.random_uintegers()`` > * ``.random_sample()`` > > Furthermore, the new design should also provide one generator class (we > shall > call it ``StableRandom`` for discussion purposes) that provides a slightly > broader subset of distribution methods for which stream-compatibility is > *guaranteed*. The point of ``StableRandom`` is to provide something that > can > be used in unit tests so projects that currently have tests which rely on > the > precise stream can be migrated off of ``RandomState``. For the best > transition, ``StableRandom`` should use as its core uniform PRNG the > current > MT19937 algorithm. As best as possible, the API for the distribution > methods > that are provided on ``StableRandom`` should match their counterparts on > ``RandomState``. They should provide the same stream that the current > version > of ``RandomState`` does. Because their intended use is for unit tests, we > do > not need the performance improvements from the new algorithms that will be > introduced by the new subsystem. > > The list of ``StableRandom`` methods should be chosen to support unit > tests: > > * ``.randint()`` > * ``.uniform()`` > * ``.normal()`` > * ``.standard_normal()`` > * ``.choice()`` > * ``.shuffle()`` > * ``.permutation()`` > > > Not Versioning > -------------- > > For a long time, we considered that the way to allow algorithmic > improvements > while maintaining the stream was to apply some form of versioning. That > is, > every time we make a stream change in one of the distributions, we > increment > some version number somewhere. ``numpy.random`` would keep all past > versions > of the code, and there would be a way to get the old versions. Proposals > of > how to do this exactly varied widely, but we will not exhaustively list > them > here. We spent years going back and forth on these designs and were not > able > to find one that sufficed. Let that time lost, and more importantly, the > contributors that we lost while we dithered, serve as evidence against the > notion. > > Concretely, adding in versioning makes maintenance of ``numpy.random`` > difficult. Necessarily, we would be keeping lots of versions of the same > code > around. Adding a new algorithm safely would still be quite hard. > > But most importantly, versioning is fundamentally difficult to *use* > correctly. > We want to make it easy and straightforward to get the latest, fastest, > best > versions of the distribution algorithms; otherwise, what's the point? The > way > to make that easy is to make the latest the default. But the default will > necessarily change from release to release, so the user?s code would need > to be > altered anyway to specify the specific version that one wants to replicate. > > Adding in versioning to maintain stream-compatibility would still only > provide > the same level of stream-compatibility that we currently do, with all of > the > limitations described earlier. Given that the standard practice for such > needs > is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` > alone > is superfluous. > > > Discussion > ---------- > > - https://mail.python.org/pipermail/numpy-discussion/ > 2018-January/077608.html > - https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 > > > Copyright > --------- > > This document has been placed in the public domain. > > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 3 19:23:58 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 3 Jun 2018 19:23:58 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: > > In most cases, I suspect that the overhead of a function call and checking > several arguments for "__array_function__" will be negligible, like the > situation for __array_ufunc__. I'm not strongly opposed to either of your > proposed solutions, but I do think it would be a little strange to insist > that we need a solution for __array_function__ when __array_ufunc__ was > fine. > Ufuncs actually do try to speed-up array checks - but indeed the same can (and should) be done for `__array_ufunc__`. They also do have `subok`. This currently ignored but that is mostly because looking for it in `kwargs` is so damn slow! Anyway, my main point was that it should be explicitly mentioned as a constraint that for pure ndarray input, things should be really fast. > > A. Two "namespaces", one for the undecorated base functions, and one >> completely trivial one for the decorated ones. The idea would be that if >> one knows one is dealing with arrays only, one would do `import >> numpy.array_only as np` (i.e., the reverse of the suggestion currently in >> the NEP, where the decorated ones are in their own namespace - I agree with >> the reasons for discounting that one). >> > > I will mention this as a possibility. > > I do think there is something to be said for clear separation of > overloaded and non-overloaded APIs. But f I were to choose between adding > numpy.api and numpy.array_only, I would pick numpy.api, because of the > virtue of preserving the existing numpy namespace as it currently exists. > Good point. Overall, the separate namespaces probably is not the way to do. > > B. Automatic insertion by the decorator of an `array_only=np._NoValue` (or >> `coerce` and perhaps `subok=...` if not present) in the function signature, >> so that users who know that they have arrays only could pass >> `array_only=True` (name to be decided). >> > > Rather than adding another argument to every NumPy function, I would > rather encourage writing np.asarray() explicitly. > Good point - just as good as long as the check for all-array is very fast (which it should be - `arg.__class__ is np.ndarray` is fast!). > Note that both A and B could also address, at least partially, the problem >> of sometimes wanting to just use the old coercion methods, i.e., not having >> to implement every possible numpy function in one go in a new >> `__array_function__` on one's class. >> > > Yes, agreed. > > >> 1. I'm rather unclear about the use of `types`. It can help me decide >> what to do, but I would still have to find the argument in question (e.g., >> for Quantity, the unit of the relevant argument). I'd recommend passing >> instead a tuple of all arguments that were inspected, in the inspection >> order; after all, it is just a `arg.__class__` away from the type, and in >> your example you'd only have to replace `issubclass` by `isinstance`. >> > > The virtue of a `types` argument is that we can deduplicate arguments > once, rather than in each __array_function__ check. This could result in > significantly more efficient code, e.g,. when np.concatenate() is called on > 10,000 arrays with only two unique types, we don't need to loop through all > 10,000 again objects to check that overloading is valid. > I think one might still want to know *where* the type occurs (e.g., as an output or index would have different implications). Possibly, a solution would rely on the same structure as used for the "dance". But as a general point, I don't see the advantage of passing types rather than arguments - less information for no benefit. > Even for Quantity, I suspect you will want two layers of checks: > 1. A check to verify that every argument is a Quantity (or something > coercible to a Quantity). This could use `types` and return > `NotImplemented` when it fails. > 2. A check to verify that units match. This will have custom logic for > different operations and will require checking all arguments -- not just > their unique types. > Not sure. With, Quantity I generally do not worry about other types, but rather look at units attributes, assume anything without is dimensionless, cast Quantity to array with the right unit, and then defer to `ndarray`. > For many Quantity functions, the second check will indeed probably be > super simple (i.e., verifying that all units match). But the first check > (with `types`) really is something that basically very overload should do. > > >> 2. For subclasses, it would be very handy to have >> `ndarray.__array_function__`, so one can call super after changing >> arguments. (For `__array_ufunc__`, there was lots of question about whether >> this was useful, but it really is!!). [I think you already agreed with >> this, but want to have it in-place, as for subclasses of ndarray this is >> just as useful as it would be for subclasses of dask arrays.) >> > > Yes, indeed. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 19:28:37 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 16:28:37 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 11:12 AM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > On Sun, Jun 3, 2018 at 2:00 PM, Hameer Abbasi > wrote: > >> >> - Objects that don?t implement ``__array_function__`` should be >> treated as having returned ``np.NotImplementedButCoercible``. >> - This has the effect of coercing ``list``, etc. >> - At a minimum, to maintain compatibility, if all objects don?t >> implement ``__array_function__``, the old behaviour should stay. >> >> I think that in the proposed scheme this is effectively what happens. > The current proposal is to copy the behavior of __array_ufunc__. So the non-existence of an __array_function__ attribute is indeed *not* equivalent to returning NotImplemented: if no arguments implement __array_function__, then yes they will all be coerced to NumPy arrays. I do think there is elegance in defining a return value of np.NotImplementedButCoercible as equivalent to the existence of __array_function__. This resolves my design question about how coercible arguments would be coerced with NotImplementedButCoercible: we would fall back to the current behavior, which in most cases means all arguments are coerced to NumPy arrays directly. Mixed return values of NotImplementedButCoercible and NotImplemented would still result in TypeError, and there would be no second chances for overloads. This is simple enough that I am inclined to update the NEP to incorporate the suggestion (thank you!). My main question is whether we should also update __array_ufunc__ to support returning NotImplementedButCoercible for consistency. My inclination is yes: even though it's easy to implement a fallback of converting all arguments to NumPy arrays for ufuncs, it is hard to do this correctly from an __array_ufunc__ implementation, because __array_ufunc__ implementations do not know in what order they have been called. The counter-argument would be that it's not worth adding new features to __array_ufunc__ if use-cases haven't come up yet. But my guess is that most users/implementors of __array_ufunc__ are ignorant of these finer details, and not really worrying about them. Also, the list of binary operators in Python is short enough that most implementations are OK with supporting either all or none. Actually, a return value of NotImplementedButCoercible would probably be the right answer for some cases in xarray's current __array_ufunc__ method, when we encounter ufunc methods for which we haven't written an implementation (e.g., "outer" or "at"). -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sun Jun 3 19:33:32 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sun, 3 Jun 2018 16:33:32 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: You make a bunch of good points refuting reproducible research as an argument for not changing the random number streams. However, there?s a second use-case you don?t address - unit tests. For better or worse, downstream, or even our own , unit tests use a seeded random number generator as a shorthand to produce some arbirary array, and then hard-code the expected output in their tests. Breaking stream compatibility will break these tests. I don?t think writing tests in this way is particularly good idea, but unfortunately they do still exist. It would be good to address this use case in the NEP, even if the conclusion is just ?changing the stream will break tests of this form? Eric On Sat, 2 Jun 2018 at 12:05 Robert Kern robert.kern at gmail.com wrote: As promised distressingly many months ago, I have written up a NEP about > relaxing the stream-compatibility policy that we currently have. > > https://github.com/numpy/numpy/pull/11229 > > https://github.com/rkern/numpy/blob/nep/rng/doc/neps/nep-0019-rng-policy.rst > > I particularly invite comment on the two lists of methods that we still > would make strict compatibility guarantees for. > > --- > > ============================== > Random Number Generator Policy > ============================== > > :Author: Robert Kern > :Status: Draft > :Type: Standards Track > :Created: 2018-05-24 > > > Abstract > -------- > > For the past decade, NumPy has had a strict backwards compatibility policy > for > the number stream of all of its random number distributions. Unlike other > numerical components in ``numpy``, which are usually allowed to return > different when results when they are modified if they remain correct, we > have > obligated the random number distributions to always produce the exact same > numbers in every version. The objective of our stream-compatibility > guarantee > was to provide exact reproducibility for simulations across numpy versions > in > order to promote reproducible research. However, this policy has made it > very > difficult to enhance any of the distributions with faster or more accurate > algorithms. After a decade of experience and improvements in the > surrounding > ecosystem of scientific software, we believe that there are now better > ways to > achieve these objectives. We propose relaxing our strict > stream-compatibility > policy to remove the obstacles that are in the way of accepting > contributions > to our random number generation capabilities. > > > The Status Quo > -------------- > > Our current policy, in full: > > A fixed seed and a fixed series of calls to ``RandomState`` methods > using the > same parameters will always produce the same results up to roundoff > error > except when the values were incorrect. Incorrect values will be fixed > and > the NumPy version in which the fix was made will be noted in the > relevant > docstring. Extension of existing parameter ranges and the addition of > new > parameters is allowed as long the previous behavior remains unchanged. > > This policy was first instated in Nov 2008 (in essence; the full set of > weasel > words grew over time) in response to a user wanting to be sure that the > simulations that formed the basis of their scientific publication could be > reproduced years later, exactly, with whatever version of ``numpy`` that > was > current at the time. We were keen to support reproducible research, and > it was > still early in the life of ``numpy.random``. We had not seen much cause to > change the distribution methods all that much. > > We also had not thought very thoroughly about the limits of what we really > could promise (and by ?we? in this section, we really mean Robert Kern, > let?s > be honest). Despite all of the weasel words, our policy overpromises > compatibility. The same version of ``numpy`` built on different > platforms, or > just in a different way could cause changes in the stream, with varying > degrees > of rarity. The biggest is that the ``.multivariate_normal()`` method > relies on > ``numpy.linalg`` functions. Even on the same platform, if one links > ``numpy`` > with a different LAPACK, ``.multivariate_normal()`` may well return > completely > different results. More rarely, building on a different OS or CPU can > cause > differences in the stream. We use C ``long`` integers internally for > integer > distribution (it seemed like a good idea at the time), and those can vary > in > size depending on the platform. Distribution methods can overflow their > internal C ``longs`` at different breakpoints depending on the platform and > cause all of the random variate draws that follow to be different. > > And even if all of that is controlled, our policy still does not provide > exact > guarantees across versions. We still do apply bug fixes when correctness > is at > stake. And even if we didn?t do that, any nontrivial program does more > than > just draw random numbers. They do computations on those numbers, transform > those with numerical algorithms from the rest of ``numpy``, which is not > subject to so strict a policy. Trying to maintain stream-compatibility > for our > random number distributions does not help reproducible research for these > reasons. > > The standard practice now for bit-for-bit reproducible research is to pin > all > of the versions of code of your software stack, possibly down to the OS > itself. > The landscape for accomplishing this is much easier today than it was in > 2008. > We now have ``pip``. We now have virtual machines. Those who need to > reproduce simulations exactly now can (and ought to) do so by using the > exact > same version of ``numpy``. We do not need to maintain stream-compatibility > across ``numpy`` versions to help them. > > Our stream-compatibility guarantee has hindered our ability to make > improvements to ``numpy.random``. Several first-time contributors have > submitted PRs to improve the distributions, usually by implementing a > faster, > or more accurate algorithm than the one that is currently there. > Unfortunately, most of them would have required breaking the stream to do > so. > Blocked by our policy, and our inability to work around that policy, many > of > those contributors simply walked away. > > > Implementation > -------------- > > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. Deprecation warnings will start silent but > become > increasingly noisy over time. Bugs in the current state of the code will > *not* > be fixed if fixing them would impact the stream. However, if changes in > the > rest of ``numpy`` would break something in the ``RandomState`` code, we > will > fix ``RandomState`` to continue working (for example, some change in the > C API). No new features will be added to ``RandomState``. Users should > migrate to the new subsystem as they are able to. > > Work on a proposed `new PRNG subsystem > `_ is already underway. The > specifics > of the new design are out of scope for this NEP and up for much > discussion, but > we will discuss general policies that will guide the evolution of whatever > code > is adopted. > > First, we will maintain API source compatibility just as we do with the > rest of > ``numpy``. If we *must* make a breaking change, we will only do so with an > appropriate deprecation period and warnings. > > Second, breaking stream-compatibility in order to introduce new features or > improve performance will be *allowed* with *caution*. Such changes will be > considered features, and as such will be no faster than the standard > release > cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness > is > not a bug. Correctness bug fixes that break stream-compatibility can > happen on > bugfix releases, per usual, but developers should consider if they can wait > until the next feature release. We encourage developers to strongly weight > user?s pain from the break in stream-compatibility against the > improvements. > One example of a worthwhile improvement would be to change algorithms for > a significant increase in performance, for example, moving from the > `Box-Muller > transform `_ > method > of Gaussian variate generation to the faster `Ziggurat algorithm > `_. An example of an > unworthy improvement would be tweaking the Ziggurat tables just a little > bit. > > Any new design for the RNG subsystem will provide a choice of different > core > uniform PRNG algorithms. We will be more strict about a select subset of > methods on these core PRNG objects. They MUST guarantee > stream-compatibility > for a minimal, specified set of methods which are chosen to make it easier > to > compose them to build other distributions. Namely, > > * ``.bytes()`` > * ``.random_uintegers()`` > * ``.random_sample()`` > > Furthermore, the new design should also provide one generator class (we > shall > call it ``StableRandom`` for discussion purposes) that provides a slightly > broader subset of distribution methods for which stream-compatibility is > *guaranteed*. The point of ``StableRandom`` is to provide something that > can > be used in unit tests so projects that currently have tests which rely on > the > precise stream can be migrated off of ``RandomState``. For the best > transition, ``StableRandom`` should use as its core uniform PRNG the > current > MT19937 algorithm. As best as possible, the API for the distribution > methods > that are provided on ``StableRandom`` should match their counterparts on > ``RandomState``. They should provide the same stream that the current > version > of ``RandomState`` does. Because their intended use is for unit tests, we > do > not need the performance improvements from the new algorithms that will be > introduced by the new subsystem. > > The list of ``StableRandom`` methods should be chosen to support unit > tests: > > * ``.randint()`` > * ``.uniform()`` > * ``.normal()`` > * ``.standard_normal()`` > * ``.choice()`` > * ``.shuffle()`` > * ``.permutation()`` > > > Not Versioning > -------------- > > For a long time, we considered that the way to allow algorithmic > improvements > while maintaining the stream was to apply some form of versioning. That > is, > every time we make a stream change in one of the distributions, we > increment > some version number somewhere. ``numpy.random`` would keep all past > versions > of the code, and there would be a way to get the old versions. Proposals > of > how to do this exactly varied widely, but we will not exhaustively list > them > here. We spent years going back and forth on these designs and were not > able > to find one that sufficed. Let that time lost, and more importantly, the > contributors that we lost while we dithered, serve as evidence against the > notion. > > Concretely, adding in versioning makes maintenance of ``numpy.random`` > difficult. Necessarily, we would be keeping lots of versions of the same > code > around. Adding a new algorithm safely would still be quite hard. > > But most importantly, versioning is fundamentally difficult to *use* > correctly. > We want to make it easy and straightforward to get the latest, fastest, > best > versions of the distribution algorithms; otherwise, what's the point? The > way > to make that easy is to make the latest the default. But the default will > necessarily change from release to release, so the user?s code would need > to be > altered anyway to specify the specific version that one wants to replicate. > > Adding in versioning to maintain stream-compatibility would still only > provide > the same level of stream-compatibility that we currently do, with all of > the > limitations described earlier. Given that the standard practice for such > needs > is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` > alone > is superfluous. > > > Discussion > ---------- > > - > https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html > - https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 > > > Copyright > --------- > > This document has been placed in the public domain. > > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 19:36:00 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 16:36:00 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser wrote: > You make a bunch of good points refuting reproducible research as an > argument for not changing the random number streams. > > However, there?s a second use-case you don?t address - unit tests. For > better or worse, downstream, or even our own > , > unit tests use a seeded random number generator as a shorthand to produce > some arbirary array, and then hard-code the expected output in their tests. > Breaking stream compatibility will break these tests. > > I don?t think writing tests in this way is particularly good idea, but > unfortunately they do still exist. > > It would be good to address this use case in the NEP, even if the > conclusion is just ?changing the stream will break tests of this form? > I do! Search for "unit test" or "StableRandom". :-) -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 19:45:54 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 16:45:54 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 4:25 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > I think one might still want to know *where* the type occurs (e.g., as an > output or index would have different implications). > This in certainly true in general, but given the complete flexibility of __array_function__ there's no way we can make every check convenient. The best we can do is make it easy to handle the common cases, where the argument position does not matter. > Possibly, a solution would rely on the same structure as used for the > "dance". But as a general point, I don't see the advantage of passing types > rather than arguments - less information for no benefit. > Maybe this is premature optimization, but there will certainly be fewer unique types than arguments to check for types. I suspect this may make for a noticeable difference in performance in use cases involving a large number of argument. For example, suppose np.concatenate() is called on a list of 10,000 dask arrays. Now dask.array.Array.__array_function__ needs to check all arguments to decide whether it can use dask.array.concatenate() or needs to return NotImplemented. By using the `types` argument, it only needs to do isinstance() checks on the single argument in `types`, rather than all 10,000 overloaded function arguments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 20:18:38 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 17:18:38 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 2, 2018 at 12:06 PM Robert Kern wrote: > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. Deprecation warnings will start silent but > become > increasingly noisy over time. Bugs in the current state of the code will > *not* > be fixed if fixing them would impact the stream. However, if changes in > the > rest of ``numpy`` would break something in the ``RandomState`` code, we > will > fix ``RandomState`` to continue working (for example, some change in the > C API). No new features will be added to ``RandomState``. Users should > migrate to the new subsystem as they are able to. > Robert, thanks for this proposal. I think it makes a lot of sense and will help maintain the long-term viability of numpy.random. The main clarification I would like to see addressed is what "freezing RandomState" means for top level functions in numpy.random. I think we could safely swap out the underlying implementation if numpy.random.seed() is not explicitly called, but how would we handle cases where a seed is explicitly set? You and I both agree that this is an anti-pattern for numpy.random, but certainly there is plenty of code that relies on the stability of random numbers when seeds are set by np.random.seed(). Similar to the case for RandomState, we would presumably need to start issuing warnings when seed() is explicitly called, which begs the question of what (if anything) we propose to replace seed() with. I suppose this will be your next NEP :). -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 20:21:56 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 17:21:56 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: Moving some of the Github PR comments here: Implementation > -------------- > > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. > https://github.com/numpy/numpy/pull/11229#discussion_r192604195 @bashtage writes: > RandomState could pretty easily be spun out into a stand-alone package, if useful. It is effectively a stand-alone submodule already. Indeed. That would be a graceful forever-home for the code for anyone who needs it. However, I'd still only make that switch after at least a few years of deprecation inside numpy. And maybe a 2.0.0 release. > Any new design for the RNG subsystem will provide a choice of different > core > uniform PRNG algorithms. We will be more strict about a select subset of > methods on these core PRNG objects. They MUST guarantee > stream-compatibility > for a minimal, specified set of methods which are chosen to make it easier > to > compose them to build other distributions. Namely, > > * ``.bytes()`` > * ``.random_uintegers()`` > * ``.random_sample()`` > BTW, `random_uintegers()` is a new method in Kevin Sheppard's `randomgen`, and I am referring to its semantics here. https://github.com/bashtage/randomgen/blob/master/randomgen/generator.pyx#L191 https://github.com/numpy/numpy/pull/11229#discussion_r192604275 @bashtage writes: > One of these (bytes, uintegers) seems redundant. uintegers should probably by 64 bit. Because different core generators have different "native" outputs (MT19937, PCG32 output `uint32`s, PCG64 outputs `uint64`s, and some that I hope we never implement natively output doubles), there are some simple, but non-trivial choices to make to support each of these. I would like the core generator's author to make those choices and maintain them. They're not hard, but they are the kind of thing that ought to be decided once and consistently. I am of the opinion that `uintegers` should support at least `uint32` and `uint64` as those are the most common native outputs among core generators. There should be a maintained way to get that native format (and yes, I'd rather have the user be explicit about it than have `random_native_uint()` in addition to `random_uint64()`). This argument extends to `.bytes()`, too, now that I think about it. A stream of bytes is a native format for some generators, too, like if we decide to hook up /dev/urandom or other file-backed interface. Hmm, what do you think about adding `random_interval()` to this list? And raising that up to the Python API level (a la what Python 3 did with exposing `secrets.randbelow()` as a primitive)? https://github.com/bashtage/randomgen/blob/master/randomgen/src/distributions/distributions.c#L1164-L1200 Many, many uses of this method would be with numbers much less than 1<<32 (e.g. Fisher-Yates shuffle), and for the 32-bit native PRNGs could mean using half as many core PRNG draws if `random_interval()` is implemented along with the core PRNG to make use of that fact. The list of ``StableRandom`` methods should be chosen to support unit tests: > > * ``.randint()`` > * ``.uniform()`` > * ``.normal()`` > * ``.standard_normal()`` > * ``.choice()`` > * ``.shuffle()`` > * ``.permutation()`` > https://github.com/numpy/numpy/pull/11229#discussion_r192604311 @bashtage writes: > standard_gamma and standard_exponential are important enough to be included here IMO. "Importance" was not my criterion, only whether they are used in unit test suites. This list was just off the top of my head for methods that I think were actually used in test suites, so I'd be happy to be shown live tests that use other methods. I'd like to be a *little* conservative about what methods we stick in here, but we don't have to be *too* conservative, since we are explicitly never going to be modifying these. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 20:36:58 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 17:36:58 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser wrote: > You make a bunch of good points refuting reproducible research as an > argument for not changing the random number streams. > > However, there?s a second use-case you don?t address - unit tests. For > better or worse, downstream, or even our own > , > unit tests use a seeded random number generator as a shorthand to produce > some arbirary array, and then hard-code the expected output in their tests. > Breaking stream compatibility will break these tests. > By the way, the reason that I didn't mention this use case as a motivation in the Status Quo section because, as I reviewed my mail archive, this wasn't actually a motivating use case for the policy. It's certainly a use case that developed once we did make these (*cough*extravagant*cough*) guarantees, though, as people started to rely on it, and I hope that my StableRandom proposal addresses it to your satisfaction. I could add some more details about that history if you think it would be useful. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 20:37:45 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 17:37:45 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 5:23 PM Stephan Hoyer wrote: > On Sat, Jun 2, 2018 at 12:06 PM Robert Kern wrote: > >> We propose first freezing ``RandomState`` as it is and developing a new >> RNG >> subsystem alongside it. This allows anyone who has been relying on our >> old >> stream-compatibility guarantee to have plenty of time to migrate. >> ``RandomState`` will be considered deprecated, but with a long deprecation >> cycle, at least a few years. Deprecation warnings will start silent but >> become >> increasingly noisy over time. Bugs in the current state of the code will >> *not* >> be fixed if fixing them would impact the stream. However, if changes in >> the >> rest of ``numpy`` would break something in the ``RandomState`` code, we >> will >> fix ``RandomState`` to continue working (for example, some change in the >> C API). No new features will be added to ``RandomState``. Users should >> migrate to the new subsystem as they are able to. >> > > Robert, thanks for this proposal. I think it makes a lot of sense and will > help maintain the long-term viability of numpy.random. > > The main clarification I would like to see addressed is what "freezing > RandomState" means for top level functions in numpy.random. I think we > could safely swap out the underlying implementation if numpy.random.seed() > is not explicitly called, but how would we handle cases where a seed is > explicitly set? > > You and I both agree that this is an anti-pattern for numpy.random, but > certainly there is plenty of code that relies on the stability of random > numbers when seeds are set by np.random.seed(). Similar to the case for > RandomState, we would presumably need to start issuing warnings when seed() > is explicitly called, which begs the question of what (if anything) we > propose to replace seed() with. > Well, *I* propose `AttributeError`, myself? > I suppose this will be your next NEP :). > I deliberately left it out of this one as it may, depending on our choices, impinge upon the design of the new PRNG subsystem, which I declared out of scope for this NEP. I have ideas (besides the glib "Let them eat AttributeErrors!"), and now that I think more about it, that does seem like it might be in scope just like the discussion of freezing RandomState and StableRandom are. But I think I'd like to hold that thought a little bit and get a little more screaming^Wfeedback on the core proposal first. I'll return to this in a few days if not sooner. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 3 20:35:55 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 3 Jun 2018 20:35:55 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Although I'm still not 100% convinced by NotImplementedButCoercible, I do like the idea that this is the default for items that do not implement `__array_function__`. And it might help avoid trying to find oneself in a possibly long list. -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 3 20:45:31 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 3 Jun 2018 20:45:31 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern wrote: > Moving some of the Github PR comments here: > > Implementation >> -------------- >> >> We propose first freezing ``RandomState`` as it is and developing a new >> RNG >> subsystem alongside it. This allows anyone who has been relying on our >> old >> stream-compatibility guarantee to have plenty of time to migrate. >> ``RandomState`` will be considered deprecated, but with a long deprecation >> cycle, at least a few years. >> > > https://github.com/numpy/numpy/pull/11229#discussion_r192604195 > @bashtage writes: > > RandomState could pretty easily be spun out into a stand-alone package, > if useful. It is effectively a stand-alone submodule already. > > Indeed. That would be a graceful forever-home for the code for anyone who > needs it. However, I'd still only make that switch after at least a few > years of deprecation inside numpy. And maybe a 2.0.0 release. > > >> Any new design for the RNG subsystem will provide a choice of different >> core >> uniform PRNG algorithms. We will be more strict about a select subset of >> methods on these core PRNG objects. They MUST guarantee >> stream-compatibility >> for a minimal, specified set of methods which are chosen to make it >> easier to >> compose them to build other distributions. Namely, >> >> * ``.bytes()`` >> * ``.random_uintegers()`` >> > * ``.random_sample()`` >> > > BTW, `random_uintegers()` is a new method in Kevin Sheppard's `randomgen`, > and I am referring to its semantics here. > https://github.com/bashtage/randomgen/blob/master/ > randomgen/generator.pyx#L191 > > https://github.com/numpy/numpy/pull/11229#discussion_r192604275 > @bashtage writes: > > One of these (bytes, uintegers) seems redundant. uintegers should > probably by 64 bit. > > Because different core generators have different "native" outputs > (MT19937, PCG32 output `uint32`s, PCG64 outputs `uint64`s, and some that I > hope we never implement natively output doubles), there are some simple, > but non-trivial choices to make to support each of these. I would like the > core generator's author to make those choices and maintain them. They're > not hard, but they are the kind of thing that ought to be decided once and > consistently. > > I am of the opinion that `uintegers` should support at least `uint32` and > `uint64` as those are the most common native outputs among core generators. > There should be a maintained way to get that native format (and yes, I'd > rather have the user be explicit about it than have `random_native_uint()` > in addition to `random_uint64()`). > > This argument extends to `.bytes()`, too, now that I think about it. A > stream of bytes is a native format for some generators, too, like if we > decide to hook up /dev/urandom or other file-backed interface. > > Hmm, what do you think about adding `random_interval()` to this list? And > raising that up to the Python API level (a la what Python 3 did with > exposing `secrets.randbelow()` as a primitive)? > https://github.com/bashtage/randomgen/blob/master/ > randomgen/src/distributions/distributions.c#L1164-L1200 > > Many, many uses of this method would be with numbers much less than 1<<32 > (e.g. Fisher-Yates shuffle), and for the 32-bit native PRNGs could mean > using half as many core PRNG draws if `random_interval()` is implemented > along with the core PRNG to make use of that fact. > > The list of ``StableRandom`` methods should be chosen to support unit >> tests: >> >> * ``.randint()`` >> * ``.uniform()`` >> * ``.normal()`` >> * ``.standard_normal()`` >> * ``.choice()`` >> * ``.shuffle()`` >> * ``.permutation()`` >> > > https://github.com/numpy/numpy/pull/11229#discussion_r192604311 > @bashtage writes: > > standard_gamma and standard_exponential are important enough to be > included here IMO. > > "Importance" was not my criterion, only whether they are used in unit test > suites. This list was just off the top of my head for methods that I think > were actually used in test suites, so I'd be happy to be shown live tests > that use other methods. I'd like to be a *little* conservative about what > methods we stick in here, but we don't have to be *too* conservative, since > we are explicitly never going to be modifying these. > That's one area where I thought the selection is too narrow. We should be able to get a stable stream from the uniform for some distributions. However, according to the Wikipedia description Poisson doesn't look easy. I just wrote a unit test for statsmodels using Poisson random numbers with hard coded numbers for the regression tests. I'm not sure which other distributions are common enough and not easily reproducible by transformation. E.g. negative binomial can be reproduces by a gamma-poisson mixture. On the other hand normal can be easily recreated from standard_normal. Would it be difficult to keep this list large, given that it should be frozen, low maintenance code ? Josef > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 3 20:45:56 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 3 Jun 2018 20:45:56 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: This in certainly true in general, but given the complete flexibility of __array_function__ there's no way we can make every check convenient. The best we can do is make it easy to handle the common cases, where the argument position does not matter. I think those cases may not be as common as you think - most functions are not like `concatenate` & co... Indeed, it might be good to add some other examples to the NEP. Looing at the list of functions which do not work with Quantity currently: Maybe `np.dot`, `np.choose`, and `np.vectorize`? > Possibly, a solution would rely on the same structure as used for the > "dance". But as a general point, I don't see the advantage of passing types > rather than arguments - less information for no benefit. > > Maybe this is premature optimization, but there will certainly be fewer unique types than arguments to check for types. I suspect this may make for a noticeable difference in performance in use cases involving a large number of argument. One also needs to worry about the cost of contructing `types`, though I guess this could be minimal if it is a `set`. Or should it be the keys of a `dict`, with the value something meaningful that has to be calculated anyway (like a list of sequence numbers); this may all depend a bit on the implementation of "dance" - the information it gathers might as well get passed on. > For example, suppose np.concatenate() is called on a list of 10,000 dask arrays. Now dask.array.Array.__array_function__ needs to check all arguments to decide whether it can use dask.array.concatenate() or needs to return NotImplemented. By using the `types` argument, it only needs to do isinstance() checks on the single argument in `types`, rather than all 10,000 overloaded function arguments It is probably a good idea to add some of these considerations to the NEP. -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 3 20:57:12 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 3 Jun 2018 20:57:12 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 8:36 PM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser > wrote: > >> You make a bunch of good points refuting reproducible research as an >> argument for not changing the random number streams. >> >> However, there?s a second use-case you don?t address - unit tests. For >> better or worse, downstream, or even our own >> , >> unit tests use a seeded random number generator as a shorthand to produce >> some arbirary array, and then hard-code the expected output in their tests. >> Breaking stream compatibility will break these tests. >> > By the way, the reason that I didn't mention this use case as a motivation > in the Status Quo section because, as I reviewed my mail archive, this > wasn't actually a motivating use case for the policy. It's certainly a use > case that developed once we did make these (*cough*extravagant*cough*) > guarantees, though, as people started to rely on it, and I hope that my > StableRandom proposal addresses it to your satisfaction. I could add some > more details about that history if you think it would be useful. > I don't think that's accurate. The unit tests for stable random numbers were added when Enthought silently changed the normal random numbers and we got messages from users that the unit tests fail and they cannot reproduce our results. 6/12/10 [SciPy-Dev] seeded randn gets different values on osx (I don't find an online copy, this is from my own mail archive) AFAIR Josef > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 21:04:55 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 18:04:55 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 6:01 PM wrote: > > > On Sun, Jun 3, 2018 at 8:36 PM, Robert Kern wrote: > >> On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser >> wrote: >> >>> You make a bunch of good points refuting reproducible research as an >>> argument for not changing the random number streams. >>> >>> However, there?s a second use-case you don?t address - unit tests. For >>> better or worse, downstream, or even our own >>> , >>> unit tests use a seeded random number generator as a shorthand to produce >>> some arbirary array, and then hard-code the expected output in their tests. >>> Breaking stream compatibility will break these tests. >>> >> By the way, the reason that I didn't mention this use case as a >> motivation in the Status Quo section because, as I reviewed my mail >> archive, this wasn't actually a motivating use case for the policy. It's >> certainly a use case that developed once we did make these >> (*cough*extravagant*cough*) guarantees, though, as people started to rely >> on it, and I hope that my StableRandom proposal addresses it to your >> satisfaction. I could add some more details about that history if you >> think it would be useful. >> > > I don't think that's accurate. > The unit tests for stable random numbers were added when Enthought > silently changed the normal random numbers and we got messages from users > that the unit tests fail and they cannot reproduce our results. > > 6/12/10 > [SciPy-Dev] seeded randn gets different values on osx > > (I don't find an online copy, this is from my own mail archive) > The policy was in place Nov 2008. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 21:08:38 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 18:08:38 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 5:46 PM wrote: > > > On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern wrote: > >> >> The list of ``StableRandom`` methods should be chosen to support unit >>> tests: >>> >>> * ``.randint()`` >>> * ``.uniform()`` >>> * ``.normal()`` >>> * ``.standard_normal()`` >>> * ``.choice()`` >>> * ``.shuffle()`` >>> * ``.permutation()`` >>> >> >> https://github.com/numpy/numpy/pull/11229#discussion_r192604311 >> @bashtage writes: >> > standard_gamma and standard_exponential are important enough to be >> included here IMO. >> >> "Importance" was not my criterion, only whether they are used in unit >> test suites. This list was just off the top of my head for methods that I >> think were actually used in test suites, so I'd be happy to be shown live >> tests that use other methods. I'd like to be a *little* conservative about >> what methods we stick in here, but we don't have to be *too* conservative, >> since we are explicitly never going to be modifying these. >> > > That's one area where I thought the selection is too narrow. > We should be able to get a stable stream from the uniform for some > distributions. > > However, according to the Wikipedia description Poisson doesn't look easy. > I just wrote a unit test for statsmodels using Poisson random numbers with > hard coded numbers for the regression tests. > I'd really rather people do this than use StableRandom; this is best practice, as I see it, if your tests involve making precise comparisons to expected results. StableRandom is intended as a crutch so that the pain of moving existing unit tests away from the deprecated RandomState is less onerous. I'd really rather people write better unit tests! In particular, I do not want to add any of the integer-domain distributions (aside from shuffle/permutation/choice) as these are the ones that have the platform-dependency issues with respect to 32/64-bit `long` integers. They'd be unreliable for unit tests even if we kept them stable over time. > I'm not sure which other distributions are common enough and not easily > reproducible by transformation. E.g. negative binomial can be reproduces by > a gamma-poisson mixture. > > On the other hand normal can be easily recreated from standard_normal. > I was mostly motivated by making it a bit easier to mechanically replace uses of randn(), which is probably even more common than normal() and standard_normal() in unit tests. > Would it be difficult to keep this list large, given that it should be > frozen, low maintenance code ? > I admit that I had in mind non-statistical unit tests. That is, tests that didn't depend on the precise distribution of the inputs. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 3 21:11:30 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 3 Jun 2018 21:11:30 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 2, 2018 at 3:04 PM, Robert Kern wrote: > As promised distressingly many months ago, I have written up a NEP about > relaxing the stream-compatibility policy that we currently have. > > https://github.com/numpy/numpy/pull/11229 > https://github.com/rkern/numpy/blob/nep/rng/doc/neps/ > nep-0019-rng-policy.rst > > I particularly invite comment on the two lists of methods that we still > would make strict compatibility guarantees for. > > --- > > ============================== > Random Number Generator Policy > ============================== > > :Author: Robert Kern > :Status: Draft > :Type: Standards Track > :Created: 2018-05-24 > > > Abstract > -------- > > For the past decade, NumPy has had a strict backwards compatibility policy > for > the number stream of all of its random number distributions. Unlike other > numerical components in ``numpy``, which are usually allowed to return > different when results when they are modified if they remain correct, we > have > obligated the random number distributions to always produce the exact same > numbers in every version. The objective of our stream-compatibility > guarantee > was to provide exact reproducibility for simulations across numpy versions > in > order to promote reproducible research. However, this policy has made it > very > difficult to enhance any of the distributions with faster or more accurate > algorithms. After a decade of experience and improvements in the > surrounding > ecosystem of scientific software, we believe that there are now better > ways to > achieve these objectives. We propose relaxing our strict > stream-compatibility > policy to remove the obstacles that are in the way of accepting > contributions > to our random number generation capabilities. > > > The Status Quo > -------------- > > Our current policy, in full: > > A fixed seed and a fixed series of calls to ``RandomState`` methods > using the > same parameters will always produce the same results up to roundoff > error > except when the values were incorrect. Incorrect values will be fixed > and > the NumPy version in which the fix was made will be noted in the > relevant > docstring. Extension of existing parameter ranges and the addition of > new > parameters is allowed as long the previous behavior remains unchanged. > > This policy was first instated in Nov 2008 (in essence; the full set of > weasel > words grew over time) in response to a user wanting to be sure that the > simulations that formed the basis of their scientific publication could be > reproduced years later, exactly, with whatever version of ``numpy`` that > was > current at the time. We were keen to support reproducible research, and > it was > still early in the life of ``numpy.random``. We had not seen much cause to > change the distribution methods all that much. > > We also had not thought very thoroughly about the limits of what we really > could promise (and by ?we? in this section, we really mean Robert Kern, > let?s > be honest). Despite all of the weasel words, our policy overpromises > compatibility. The same version of ``numpy`` built on different > platforms, or > just in a different way could cause changes in the stream, with varying > degrees > of rarity. The biggest is that the ``.multivariate_normal()`` method > relies on > ``numpy.linalg`` functions. Even on the same platform, if one links > ``numpy`` > with a different LAPACK, ``.multivariate_normal()`` may well return > completely > different results. More rarely, building on a different OS or CPU can > cause > differences in the stream. > AFAIK, I have never seen this. Except for some corner cases (like singular transformation) the "noise" from different linalg packages is in the range of floating point noise which is not relevant if we unit test, for example, pvalues at rtol=1e-10. Based on the unit test that don't fail, "may well return completely different results" seems exaggerated. (There can be huge jumps in results from linalg operations like svd around the near singular/singular threshold, i.e. when floating point noise is in the range of the rcond threshold, but that's independent of np.random and can happen in many cases when we want to have reproducible numerical noise which is not possible, but doesn't affect stability of results in well defined cases.) Josef > We use C ``long`` integers internally for integer > distribution (it seemed like a good idea at the time), and those can vary > in > size depending on the platform. Distribution methods can overflow their > internal C ``longs`` at different breakpoints depending on the platform and > cause all of the random variate draws that follow to be different. > > And even if all of that is controlled, our policy still does not provide > exact > guarantees across versions. We still do apply bug fixes when correctness > is at > stake. And even if we didn?t do that, any nontrivial program does more > than > just draw random numbers. They do computations on those numbers, transform > those with numerical algorithms from the rest of ``numpy``, which is not > subject to so strict a policy. Trying to maintain stream-compatibility > for our > random number distributions does not help reproducible research for these > reasons. > > The standard practice now for bit-for-bit reproducible research is to pin > all > of the versions of code of your software stack, possibly down to the OS > itself. > The landscape for accomplishing this is much easier today than it was in > 2008. > We now have ``pip``. We now have virtual machines. Those who need to > reproduce simulations exactly now can (and ought to) do so by using the > exact > same version of ``numpy``. We do not need to maintain stream-compatibility > across ``numpy`` versions to help them. > > Our stream-compatibility guarantee has hindered our ability to make > improvements to ``numpy.random``. Several first-time contributors have > submitted PRs to improve the distributions, usually by implementing a > faster, > or more accurate algorithm than the one that is currently there. > Unfortunately, most of them would have required breaking the stream to do > so. > Blocked by our policy, and our inability to work around that policy, many > of > those contributors simply walked away. > > > Implementation > -------------- > > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. Deprecation warnings will start silent but > become > increasingly noisy over time. Bugs in the current state of the code will > *not* > be fixed if fixing them would impact the stream. However, if changes in > the > rest of ``numpy`` would break something in the ``RandomState`` code, we > will > fix ``RandomState`` to continue working (for example, some change in the > C API). No new features will be added to ``RandomState``. Users should > migrate to the new subsystem as they are able to. > > Work on a proposed `new PRNG subsystem > `_ is already underway. The > specifics > of the new design are out of scope for this NEP and up for much > discussion, but > we will discuss general policies that will guide the evolution of whatever > code > is adopted. > > First, we will maintain API source compatibility just as we do with the > rest of > ``numpy``. If we *must* make a breaking change, we will only do so with an > appropriate deprecation period and warnings. > > Second, breaking stream-compatibility in order to introduce new features or > improve performance will be *allowed* with *caution*. Such changes will be > considered features, and as such will be no faster than the standard > release > cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness > is > not a bug. Correctness bug fixes that break stream-compatibility can > happen on > bugfix releases, per usual, but developers should consider if they can wait > until the next feature release. We encourage developers to strongly weight > user?s pain from the break in stream-compatibility against the > improvements. > One example of a worthwhile improvement would be to change algorithms for > a significant increase in performance, for example, moving from the > `Box-Muller > transform `_ > method > of Gaussian variate generation to the faster `Ziggurat algorithm > `_. An example of an > unworthy improvement would be tweaking the Ziggurat tables just a little > bit. > > Any new design for the RNG subsystem will provide a choice of different > core > uniform PRNG algorithms. We will be more strict about a select subset of > methods on these core PRNG objects. They MUST guarantee > stream-compatibility > for a minimal, specified set of methods which are chosen to make it easier > to > compose them to build other distributions. Namely, > > * ``.bytes()`` > * ``.random_uintegers()`` > * ``.random_sample()`` > > Furthermore, the new design should also provide one generator class (we > shall > call it ``StableRandom`` for discussion purposes) that provides a slightly > broader subset of distribution methods for which stream-compatibility is > *guaranteed*. The point of ``StableRandom`` is to provide something that > can > be used in unit tests so projects that currently have tests which rely on > the > precise stream can be migrated off of ``RandomState``. For the best > transition, ``StableRandom`` should use as its core uniform PRNG the > current > MT19937 algorithm. As best as possible, the API for the distribution > methods > that are provided on ``StableRandom`` should match their counterparts on > ``RandomState``. They should provide the same stream that the current > version > of ``RandomState`` does. Because their intended use is for unit tests, we > do > not need the performance improvements from the new algorithms that will be > introduced by the new subsystem. > > The list of ``StableRandom`` methods should be chosen to support unit > tests: > > * ``.randint()`` > * ``.uniform()`` > * ``.normal()`` > * ``.standard_normal()`` > * ``.choice()`` > * ``.shuffle()`` > * ``.permutation()`` > > > Not Versioning > -------------- > > For a long time, we considered that the way to allow algorithmic > improvements > while maintaining the stream was to apply some form of versioning. That > is, > every time we make a stream change in one of the distributions, we > increment > some version number somewhere. ``numpy.random`` would keep all past > versions > of the code, and there would be a way to get the old versions. Proposals > of > how to do this exactly varied widely, but we will not exhaustively list > them > here. We spent years going back and forth on these designs and were not > able > to find one that sufficed. Let that time lost, and more importantly, the > contributors that we lost while we dithered, serve as evidence against the > notion. > > Concretely, adding in versioning makes maintenance of ``numpy.random`` > difficult. Necessarily, we would be keeping lots of versions of the same > code > around. Adding a new algorithm safely would still be quite hard. > > But most importantly, versioning is fundamentally difficult to *use* > correctly. > We want to make it easy and straightforward to get the latest, fastest, > best > versions of the distribution algorithms; otherwise, what's the point? The > way > to make that easy is to make the latest the default. But the default will > necessarily change from release to release, so the user?s code would need > to be > altered anyway to specify the specific version that one wants to replicate. > > Adding in versioning to maintain stream-compatibility would still only > provide > the same level of stream-compatibility that we currently do, with all of > the > limitations described earlier. Given that the standard practice for such > needs > is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` > alone > is superfluous. > > > Discussion > ---------- > > - https://mail.python.org/pipermail/numpy-discussion/ > 2018-January/077608.html > - https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 > > > Copyright > --------- > > This document has been placed in the public domain. > > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 3 21:25:15 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 3 Jun 2018 21:25:15 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 9:04 PM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 6:01 PM wrote: > >> >> >> On Sun, Jun 3, 2018 at 8:36 PM, Robert Kern >> wrote: >> >>> On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser >>> wrote: >>> >>>> You make a bunch of good points refuting reproducible research as an >>>> argument for not changing the random number streams. >>>> >>>> However, there?s a second use-case you don?t address - unit tests. For >>>> better or worse, downstream, or even our own >>>> , >>>> unit tests use a seeded random number generator as a shorthand to produce >>>> some arbirary array, and then hard-code the expected output in their tests. >>>> Breaking stream compatibility will break these tests. >>>> >>> By the way, the reason that I didn't mention this use case as a >>> motivation in the Status Quo section because, as I reviewed my mail >>> archive, this wasn't actually a motivating use case for the policy. It's >>> certainly a use case that developed once we did make these >>> (*cough*extravagant*cough*) guarantees, though, as people started to rely >>> on it, and I hope that my StableRandom proposal addresses it to your >>> satisfaction. I could add some more details about that history if you >>> think it would be useful. >>> >> >> I don't think that's accurate. >> The unit tests for stable random numbers were added when Enthought >> silently changed the normal random numbers and we got messages from users >> that the unit tests fail and they cannot reproduce our results. >> >> 6/12/10 >> [SciPy-Dev] seeded randn gets different values on osx >> >> (I don't find an online copy, this is from my own mail archive) >> > > The policy was in place Nov 2008. > only for the underlying stream, but those unit tests didn't guarantee it for the actual distributions https://github.com/numpy/numpy/commit/898e6bdc625cdd3c97865ef99f8d51c5f43eafff So maybe there was a discussion in 2008 which was mostly before my time. The guarantee for distributions was added in 2010/2011, at least in terms of unit tests in numpy in order to protect the unit tests in scipy.stats and by analogy for similar cases in other packages and across users. Josef > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 3 21:52:19 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 18:52:19 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 6:26 PM wrote: > > > On Sun, Jun 3, 2018 at 9:04 PM, Robert Kern wrote: > >> On Sun, Jun 3, 2018 at 6:01 PM wrote: >> >>> >>> >>> On Sun, Jun 3, 2018 at 8:36 PM, Robert Kern >>> wrote: >>> >>>> On Sun, Jun 3, 2018 at 4:35 PM Eric Wieser >>>> wrote: >>>> >>>>> You make a bunch of good points refuting reproducible research as an >>>>> argument for not changing the random number streams. >>>>> >>>>> However, there?s a second use-case you don?t address - unit tests. For >>>>> better or worse, downstream, or even our own >>>>> , >>>>> unit tests use a seeded random number generator as a shorthand to produce >>>>> some arbirary array, and then hard-code the expected output in their tests. >>>>> Breaking stream compatibility will break these tests. >>>>> >>>> By the way, the reason that I didn't mention this use case as a >>>> motivation in the Status Quo section because, as I reviewed my mail >>>> archive, this wasn't actually a motivating use case for the policy. It's >>>> certainly a use case that developed once we did make these >>>> (*cough*extravagant*cough*) guarantees, though, as people started to rely >>>> on it, and I hope that my StableRandom proposal addresses it to your >>>> satisfaction. I could add some more details about that history if you >>>> think it would be useful. >>>> >>> >>> I don't think that's accurate. >>> The unit tests for stable random numbers were added when Enthought >>> silently changed the normal random numbers and we got messages from users >>> that the unit tests fail and they cannot reproduce our results. >>> >>> 6/12/10 >>> [SciPy-Dev] seeded randn gets different values on osx >>> >>> (I don't find an online copy, this is from my own mail archive) >>> >> >> The policy was in place Nov 2008. >> > > only for the underlying stream, but those unit tests didn't guarantee it > for the actual distributions > > https://github.com/numpy/numpy/commit/898e6bdc625cdd3c97865ef99f8d51c5f43eafff > > So maybe there was a discussion in 2008 which was mostly before my time. > The guarantee for distributions was added in 2010/2011, at least in terms > of unit tests in numpy > in order to protect the unit tests in scipy.stats and by analogy for > similar cases in other packages > and across users. > The policy existed for the distributions regardless of whether or not we had a test suite that ensured it. I cannot share internal emails, of course, but please be assured that the existence of the policy was one of my arguments for rolling back that addition to EPD (and would have been what I argued to prevent it from going out, had I been aware of it). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 3 21:54:03 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 3 Jun 2018 21:54:03 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 9:08 PM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 5:46 PM wrote: > >> >> >> On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern >> wrote: >> >>> >>> The list of ``StableRandom`` methods should be chosen to support unit >>>> tests: >>>> >>>> * ``.randint()`` >>>> * ``.uniform()`` >>>> * ``.normal()`` >>>> * ``.standard_normal()`` >>>> * ``.choice()`` >>>> * ``.shuffle()`` >>>> * ``.permutation()`` >>>> >>> >>> https://github.com/numpy/numpy/pull/11229#discussion_r192604311 >>> @bashtage writes: >>> > standard_gamma and standard_exponential are important enough to be >>> included here IMO. >>> >>> "Importance" was not my criterion, only whether they are used in unit >>> test suites. This list was just off the top of my head for methods that I >>> think were actually used in test suites, so I'd be happy to be shown live >>> tests that use other methods. I'd like to be a *little* conservative about >>> what methods we stick in here, but we don't have to be *too* conservative, >>> since we are explicitly never going to be modifying these. >>> >> >> That's one area where I thought the selection is too narrow. >> We should be able to get a stable stream from the uniform for some >> distributions. >> >> However, according to the Wikipedia description Poisson doesn't look >> easy. I just wrote a unit test for statsmodels using Poisson random numbers >> with hard coded numbers for the regression tests. >> > > I'd really rather people do this than use StableRandom; this is best > practice, as I see it, if your tests involve making precise comparisons to > expected results. > I hardcoded the results not the random data. So the unit tests rely on a reproducible stream of Poisson random numbers. I don't want to save 500 (100 or 1000) observations in a csv file for every variation of the unit test that I run. > > StableRandom is intended as a crutch so that the pain of moving existing > unit tests away from the deprecated RandomState is less onerous. I'd really > rather people write better unit tests! > > In particular, I do not want to add any of the integer-domain > distributions (aside from shuffle/permutation/choice) as these are the ones > that have the platform-dependency issues with respect to 32/64-bit `long` > integers. They'd be unreliable for unit tests even if we kept them stable > over time. > > >> I'm not sure which other distributions are common enough and not easily >> reproducible by transformation. E.g. negative binomial can be reproduces by >> a gamma-poisson mixture. >> >> On the other hand normal can be easily recreated from standard_normal. >> > > I was mostly motivated by making it a bit easier to mechanically replace > uses of randn(), which is probably even more common than normal() and > standard_normal() in unit tests. > > >> Would it be difficult to keep this list large, given that it should be >> frozen, low maintenance code ? >> > > I admit that I had in mind non-statistical unit tests. That is, tests that > didn't depend on the precise distribution of the inputs. > The problem is that the unit test in `stats` rely on precise inputs (up to some numerical noise). For example p-values themselves are uniformly distributed if the hypothesis test works correctly. That mean if I don't have control over the inputs, then my p-value could be anything in (0, 1). So either we need a real dataset, save all the random numbers in a file or have a reproducible set of random numbers. 95% of the unit tests that I write are for statistics. A large fraction of them don't rely on the exact distribution, but do rely on a random numbers that are "good enough". For example, when writing unit test, then I get every once in a while or sometimes more often a "bad" stream of random numbers, for which convergence might fail or where the estimated numbers are far away from the true numbers, so test tolerance would have to be very high. If I pick one of the seeds that looks good, then I can have tighter unit test tolerance to insure results are good in a nice case. The problem is that we cannot write robust unit tests for regression tests without stable inputs. E.g. I verified my results with a Monte Carlo with 5000 replications and 1000 Poisson observations in each. Results look close to expected and won't depend much on the exact stream of random variables. But the Monte Carlo for each variant of the test took about 40 seconds. Doing this for all option combination and dataset specification takes too long to be feasible in a unit test suite. So I rely on numpy's stable random numbers and hard code the results for a specific random sample in the regression unit tests. Josef > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 22:08:34 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 19:08:34 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 5:39 PM Robert Kern wrote: > You and I both agree that this is an anti-pattern for numpy.random, but >> certainly there is plenty of code that relies on the stability of random >> numbers when seeds are set by np.random.seed(). Similar to the case for >> RandomState, we would presumably need to start issuing warnings when seed() >> is explicitly called, which begs the question of what (if anything) we >> propose to replace seed() with. >> > > Well, *I* propose `AttributeError`, myself? > > >> I suppose this will be your next NEP :). >> > > I deliberately left it out of this one as it may, depending on our > choices, impinge upon the design of the new PRNG subsystem, which I > declared out of scope for this NEP. I have ideas (besides the glib "Let > them eat AttributeErrors!"), and now that I think more about it, that does > seem like it might be in scope just like the discussion of freezing > RandomState and StableRandom are. But I think I'd like to hold that thought > a little bit and get a little more screaming^Wfeedback on the core proposal > first. I'll return to this in a few days if not sooner. > For this NEP, it might be enough here to say that the current behavior of np.random.seed() will be deprecated just like np.random.RandomState(), since the current implementation of np.random.seed() is intimately tied to RandomState. The natural of the exact replacement (if any) can be left for future discussion. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 3 22:31:13 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 19:31:13 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 5:44 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > Although I'm still not 100% convinced by NotImplementedButCoercible, I do > like the idea that this is the default for items that do not implement > `__array_function__`. And it might help avoid trying to find oneself in a > possibly long list. > Another potential consideration in favor of NotImplementedButCoercible is for subclassing: we could use it to write the default implementations of ndarray.__array_ufunc__ and ndarray.__array_function__, e.g., class ndarray: def __array_ufunc__(self, *args, **kwargs): return NotIImplementedButCoercible def __array_function__(self, *args, **kwargs): return NotIImplementedButCoercible I think (not 100% sure yet) this would result in exactly equivalent behavior to what ndarray.__array_ufunc__ currently does: http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Jun 3 23:20:23 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 3 Jun 2018 20:20:23 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 6:54 PM, wrote: > > > On Sun, Jun 3, 2018 at 9:08 PM, Robert Kern wrote: > >> On Sun, Jun 3, 2018 at 5:46 PM wrote: >> >>> >>> >>> On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern >>> wrote: >>> >>>> >>>> The list of ``StableRandom`` methods should be chosen to support unit >>>>> tests: >>>>> >>>>> * ``.randint()`` >>>>> * ``.uniform()`` >>>>> * ``.normal()`` >>>>> * ``.standard_normal()`` >>>>> * ``.choice()`` >>>>> * ``.shuffle()`` >>>>> * ``.permutation()`` >>>>> >>>> >>>> https://github.com/numpy/numpy/pull/11229#discussion_r192604311 >>>> @bashtage writes: >>>> > standard_gamma and standard_exponential are important enough to be >>>> included here IMO. >>>> >>>> "Importance" was not my criterion, only whether they are used in unit >>>> test suites. This list was just off the top of my head for methods that I >>>> think were actually used in test suites, so I'd be happy to be shown live >>>> tests that use other methods. I'd like to be a *little* conservative about >>>> what methods we stick in here, but we don't have to be *too* conservative, >>>> since we are explicitly never going to be modifying these. >>>> >>> >>> That's one area where I thought the selection is too narrow. >>> We should be able to get a stable stream from the uniform for some >>> distributions. >>> >>> However, according to the Wikipedia description Poisson doesn't look >>> easy. I just wrote a unit test for statsmodels using Poisson random numbers >>> with hard coded numbers for the regression tests. >>> >> >> I'd really rather people do this than use StableRandom; this is best >> practice, as I see it, if your tests involve making precise comparisons to >> expected results. >> > > I hardcoded the results not the random data. So the unit tests rely on a > reproducible stream of Poisson random numbers. > I don't want to save 500 (100 or 1000) observations in a csv file for > every variation of the unit test that I run. > I agree, hardcoding numbers in every place where seeded random numbers are now used is quite unrealistic. It may be worth having a look at test suites for scipy, statsmodels, scikit-learn, etc. and estimate how much work this NEP causes those projects. If the devs of those packages are forced to do large scale migrations from RandomState to StableState, then why not instead keep RandomState and just add a new API next to it? Ralf > > >> >> StableRandom is intended as a crutch so that the pain of moving existing >> unit tests away from the deprecated RandomState is less onerous. I'd really >> rather people write better unit tests! >> >> In particular, I do not want to add any of the integer-domain >> distributions (aside from shuffle/permutation/choice) as these are the ones >> that have the platform-dependency issues with respect to 32/64-bit `long` >> integers. They'd be unreliable for unit tests even if we kept them stable >> over time. >> >> >>> I'm not sure which other distributions are common enough and not easily >>> reproducible by transformation. E.g. negative binomial can be reproduces by >>> a gamma-poisson mixture. >>> >>> On the other hand normal can be easily recreated from standard_normal. >>> >> >> I was mostly motivated by making it a bit easier to mechanically replace >> uses of randn(), which is probably even more common than normal() and >> standard_normal() in unit tests. >> >> >>> Would it be difficult to keep this list large, given that it should be >>> frozen, low maintenance code ? >>> >> >> I admit that I had in mind non-statistical unit tests. That is, tests >> that didn't depend on the precise distribution of the inputs. >> > > The problem is that the unit test in `stats` rely on precise inputs (up to > some numerical noise). > For example p-values themselves are uniformly distributed if the > hypothesis test works correctly. That mean if I don't have control over the > inputs, then my p-value could be anything in (0, 1). So either we need a > real dataset, save all the random numbers in a file or have a reproducible > set of random numbers. > > 95% of the unit tests that I write are for statistics. A large fraction of > them don't rely on the exact distribution, but do rely on a random numbers > that are "good enough". > For example, when writing unit test, then I get every once in a while or > sometimes more often a "bad" stream of random numbers, for which > convergence might fail or where the estimated numbers are far away from the > true numbers, so test tolerance would have to be very high. > If I pick one of the seeds that looks good, then I can have tighter unit > test tolerance to insure results are good in a nice case. > > The problem is that we cannot write robust unit tests for regression tests > without stable inputs. > E.g. I verified my results with a Monte Carlo with 5000 replications and > 1000 Poisson observations in each. > Results look close to expected and won't depend much on the exact stream > of random variables. > But the Monte Carlo for each variant of the test took about 40 seconds. > Doing this for all option combination and dataset specification takes too > long to be feasible in a unit test suite. > So I rely on numpy's stable random numbers and hard code the results for a > specific random sample in the regression unit tests. > > Josef > > > >> >> -- >> Robert Kern >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jun 4 00:22:17 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 3 Jun 2018 22:22:17 -0600 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 2, 2018 at 1:04 PM, Robert Kern wrote: > As promised distressingly many months ago, I have written up a NEP about > relaxing the stream-compatibility policy that we currently have. > > https://github.com/numpy/numpy/pull/11229 > https://github.com/rkern/numpy/blob/nep/rng/doc/neps/ > nep-0019-rng-policy.rst > > I particularly invite comment on the two lists of methods that we still > would make strict compatibility guarantees for. > > --- > > ============================== > Random Number Generator Policy > ============================== > > :Author: Robert Kern > :Status: Draft > :Type: Standards Track > :Created: 2018-05-24 > > > Abstract > -------- > > For the past decade, NumPy has had a strict backwards compatibility policy > for > the number stream of all of its random number distributions. Unlike other > numerical components in ``numpy``, which are usually allowed to return > different when results when they are modified if they remain correct, we > have > obligated the random number distributions to always produce the exact same > numbers in every version. The objective of our stream-compatibility > guarantee > was to provide exact reproducibility for simulations across numpy versions > in > order to promote reproducible research. However, this policy has made it > very > difficult to enhance any of the distributions with faster or more accurate > algorithms. After a decade of experience and improvements in the > surrounding > ecosystem of scientific software, we believe that there are now better > ways to > achieve these objectives. We propose relaxing our strict > stream-compatibility > policy to remove the obstacles that are in the way of accepting > contributions > to our random number generation capabilities. > > > The Status Quo > -------------- > > Our current policy, in full: > > A fixed seed and a fixed series of calls to ``RandomState`` methods > using the > same parameters will always produce the same results up to roundoff > error > except when the values were incorrect. Incorrect values will be fixed > and > the NumPy version in which the fix was made will be noted in the > relevant > docstring. Extension of existing parameter ranges and the addition of > new > parameters is allowed as long the previous behavior remains unchanged. > > This policy was first instated in Nov 2008 (in essence; the full set of > weasel > Instituted? > words grew over time) in response to a user wanting to be sure that the > simulations that formed the basis of their scientific publication could be > reproduced years later, exactly, with whatever version of ``numpy`` that > was > current at the time. We were keen to support reproducible research, and > it was > still early in the life of ``numpy.random``. We had not seen much cause to > change the distribution methods all that much. > > We also had not thought very thoroughly about the limits of what we really > could promise (and by ?we? in this section, we really mean Robert Kern, > let?s > be honest). Despite all of the weasel words, our policy overpromises > compatibility. The same version of ``numpy`` built on different > platforms, or > just in a different way could cause changes in the stream, with varying > degrees > of rarity. The biggest is that the ``.multivariate_normal()`` method > relies on > ``numpy.linalg`` functions. Even on the same platform, if one links > ``numpy`` > with a different LAPACK, ``.multivariate_normal()`` may well return > completely > different results. More rarely, building on a different OS or CPU can > cause > differences in the stream. We use C ``long`` integers internally for > integer > distribution (it seemed like a good idea at the time), and those can vary > in > size depending on the platform. Distribution methods can overflow their > internal C ``longs`` at different breakpoints depending on the platform and > cause all of the random variate draws that follow to be different. > > And even if all of that is controlled, our policy still does not provide > exact > guarantees across versions. We still do apply bug fixes when correctness > is at > stake. And even if we didn?t do that, any nontrivial program does more > than > just draw random numbers. They do computations on those numbers, transform > those with numerical algorithms from the rest of ``numpy``, which is not > subject to so strict a policy. Trying to maintain stream-compatibility > for our > random number distributions does not help reproducible research for these > reasons. > > The standard practice now for bit-for-bit reproducible research is to pin > all > of the versions of code of your software stack, possibly down to the OS > itself. > The landscape for accomplishing this is much easier today than it was in > 2008. > We now have ``pip``. We now have virtual machines. Those who need to > reproduce simulations exactly now can (and ought to) do so by using the > exact > same version of ``numpy``. We do not need to maintain stream-compatibility > across ``numpy`` versions to help them. > > Our stream-compatibility guarantee has hindered our ability to make > improvements to ``numpy.random``. Several first-time contributors have > submitted PRs to improve the distributions, usually by implementing a > faster, > or more accurate algorithm than the one that is currently there. > Unfortunately, most of them would have required breaking the stream to do > so. > Blocked by our policy, and our inability to work around that policy, many > of > those contributors simply walked away. > > > Implementation > -------------- > > We propose first freezing ``RandomState`` as it is and developing a new RNG > subsystem alongside it. This allows anyone who has been relying on our old > stream-compatibility guarantee to have plenty of time to migrate. > ``RandomState`` will be considered deprecated, but with a long deprecation > cycle, at least a few years. Deprecation warnings will start silent but > become > increasingly noisy over time. Bugs in the current state of the code will > *not* > be fixed if fixing them would impact the stream. However, if changes in > the > rest of ``numpy`` would break something in the ``RandomState`` code, we > will > fix ``RandomState`` to continue working (for example, some change in the > C API). No new features will be added to ``RandomState``. Users should > migrate to the new subsystem as they are able to. > > Work on a proposed `new PRNG subsystem > `_ is already underway. The > specifics > of the new design are out of scope for this NEP and up for much > discussion, but > we will discuss general policies that will guide the evolution of whatever > code > is adopted. > > First, we will maintain API source compatibility just as we do with the > rest of > ``numpy``. If we *must* make a breaking change, we will only do so with an > appropriate deprecation period and warnings. > > Second, breaking stream-compatibility in order to introduce new features or > improve performance will be *allowed* with *caution*. Such changes will be > considered features, and as such will be no faster than the standard > release > cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness > is > not a bug. Correctness bug fixes that break stream-compatibility can > happen on > bugfix releases, per usual, but developers should consider if they can wait > until the next feature release. We encourage developers to strongly weight > user?s pain from the break in stream-compatibility against the > improvements. > One example of a worthwhile improvement would be to change algorithms for > a significant increase in performance, for example, moving from the > `Box-Muller > transform `_ > method > of Gaussian variate generation to the faster `Ziggurat algorithm > `_. An example of an > unworthy improvement would be tweaking the Ziggurat tables just a little > bit. > > Any new design for the RNG subsystem will provide a choice of different > core > uniform PRNG algorithms. We will be more strict about a select subset of > methods on these core PRNG objects. They MUST guarantee > stream-compatibility > for a minimal, specified set of methods which are chosen to make it easier > to > compose them to build other distributions. Namely, > > * ``.bytes()`` > * ``.random_uintegers()`` > * ``.random_sample()`` > > Furthermore, the new design should also provide one generator class (we > shall > call it ``StableRandom`` for discussion purposes) that provides a slightly > broader subset of distribution methods for which stream-compatibility is > *guaranteed*. The point of ``StableRandom`` is to provide something that > can > be used in unit tests so projects that currently have tests which rely on > the > precise stream can be migrated off of ``RandomState``. For the best > transition, ``StableRandom`` should use as its core uniform PRNG the > current > MT19937 algorithm. As best as possible, the API for the distribution > methods > that are provided on ``StableRandom`` should match their counterparts on > ``RandomState``. They should provide the same stream that the current > version > of ``RandomState`` does. Because their intended use is for unit tests, we > do > not need the performance improvements from the new algorithms that will be > introduced by the new subsystem. > > The list of ``StableRandom`` methods should be chosen to support unit > tests: > > * ``.randint()`` > * ``.uniform()`` > * ``.normal()`` > * ``.standard_normal()`` > * ``.choice()`` > * ``.shuffle()`` > * ``.permutation()`` > > > Not Versioning > -------------- > > For a long time, we considered that the way to allow algorithmic > improvements > while maintaining the stream was to apply some form of versioning. That > is, > every time we make a stream change in one of the distributions, we > increment > some version number somewhere. ``numpy.random`` would keep all past > versions > of the code, and there would be a way to get the old versions. Proposals > of > how to do this exactly varied widely, but we will not exhaustively list > them > here. We spent years going back and forth on these designs and were not > able > to find one that sufficed. Let that time lost, and more importantly, the > contributors that we lost while we dithered, serve as evidence against the > notion. > > Concretely, adding in versioning makes maintenance of ``numpy.random`` > difficult. Necessarily, we would be keeping lots of versions of the same > code > around. Adding a new algorithm safely would still be quite hard. > > But most importantly, versioning is fundamentally difficult to *use* > correctly. > We want to make it easy and straightforward to get the latest, fastest, > best > versions of the distribution algorithms; otherwise, what's the point? The > way > to make that easy is to make the latest the default. But the default will > necessarily change from release to release, so the user?s code would need > to be > altered anyway to specify the specific version that one wants to replicate. > > Adding in versioning to maintain stream-compatibility would still only > provide > the same level of stream-compatibility that we currently do, with all of > the > limitations described earlier. Given that the standard practice for such > needs > is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` > alone > is superfluous. > This section is a bit unclear. Would it be correct to say that the rng version is the numpy version? If so, it might be best to say that up front before justifying it. > > > Discussion > ---------- > > - https://mail.python.org/pipermail/numpy-discussion/ > 2018-January/077608.html > - https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 > > > Copyright > --------- > > This document has been placed in the public domain. > > > Mostly off topic, but I note that the new module proposes integers of various lengths using the Python half open ranges. I would like to suggest that we modify that just a hair so we can specify the whole range in the integer interval specification. For instance, the full range of an 8 bit unsigned integer could be given as `(0, 0)`, i.e., (0, 255 + 1). This would be most useful for the biggest (64 bit) types, but I am more thinking of the case where sequences of ranges can be used. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From warren.weckesser at gmail.com Mon Jun 4 00:23:23 2018 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Mon, 4 Jun 2018 00:23:23 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 11:20 PM, Ralf Gommers wrote: > > > On Sun, Jun 3, 2018 at 6:54 PM, wrote: > >> >> >> On Sun, Jun 3, 2018 at 9:08 PM, Robert Kern >> wrote: >> >>> On Sun, Jun 3, 2018 at 5:46 PM wrote: >>> >>>> >>>> >>>> On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern >>>> wrote: >>>> >>>>> >>>>> The list of ``StableRandom`` methods should be chosen to support unit >>>>>> tests: >>>>>> >>>>>> * ``.randint()`` >>>>>> * ``.uniform()`` >>>>>> * ``.normal()`` >>>>>> * ``.standard_normal()`` >>>>>> * ``.choice()`` >>>>>> * ``.shuffle()`` >>>>>> * ``.permutation()`` >>>>>> >>>>> >>>>> https://github.com/numpy/numpy/pull/11229#discussion_r192604311 >>>>> @bashtage writes: >>>>> > standard_gamma and standard_exponential are important enough to be >>>>> included here IMO. >>>>> >>>>> "Importance" was not my criterion, only whether they are used in unit >>>>> test suites. This list was just off the top of my head for methods that I >>>>> think were actually used in test suites, so I'd be happy to be shown live >>>>> tests that use other methods. I'd like to be a *little* conservative about >>>>> what methods we stick in here, but we don't have to be *too* conservative, >>>>> since we are explicitly never going to be modifying these. >>>>> >>>> >>>> That's one area where I thought the selection is too narrow. >>>> We should be able to get a stable stream from the uniform for some >>>> distributions. >>>> >>>> However, according to the Wikipedia description Poisson doesn't look >>>> easy. I just wrote a unit test for statsmodels using Poisson random numbers >>>> with hard coded numbers for the regression tests. >>>> >>> >>> I'd really rather people do this than use StableRandom; this is best >>> practice, as I see it, if your tests involve making precise comparisons to >>> expected results. >>> >> >> I hardcoded the results not the random data. So the unit tests rely on a >> reproducible stream of Poisson random numbers. >> I don't want to save 500 (100 or 1000) observations in a csv file for >> every variation of the unit test that I run. >> > > I agree, hardcoding numbers in every place where seeded random numbers are > now used is quite unrealistic. > > It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > > As a quick and imperfect test, I monkey-patched numpy so that a call to numpy.random.seed(m) actually uses m+1000 as the seed. I ran the tests using the `runtests.py` script: *seed+1000, using 'python runtests.py -n' in the source directory:* 236 failed, 12881 passed, 1248 skipped, 585 deselected, 84 xfailed, 7 xpassed Most of the failures are in scipy.stats: *seed+1000, using 'python runtests.py -n -s stats' in the source directory:* 203 failed, 1034 passed, 4 skipped, 370 deselected, 4 xfailed, 1 xpassed Changing the amount added to the seed or running the tests using the function `scipy.test("full")` gives different (but similar magnitude) results: *seed+1000, using 'import scipy; scipy.test("full")' in an ipython shell:* 269 failed, 13359 passed, 1271 skipped, 134 xfailed, 8 xpassed *seed+1, using 'python runtests.py -n' in the source directory:* 305 failed, 12812 passed, 1248 skipped, 585 deselected, 84 xfailed, 7 xpassed I suspect many of the tests will be easy to update, so fixing 300 or so tests does not seem like a monumental task. I haven't looked into why there are 585 deselected tests; maybe there are many more tests lurking there that will have to be updated. Warren Ralf > > > >> >> >>> >>> StableRandom is intended as a crutch so that the pain of moving existing >>> unit tests away from the deprecated RandomState is less onerous. I'd really >>> rather people write better unit tests! >>> >>> In particular, I do not want to add any of the integer-domain >>> distributions (aside from shuffle/permutation/choice) as these are the ones >>> that have the platform-dependency issues with respect to 32/64-bit `long` >>> integers. They'd be unreliable for unit tests even if we kept them stable >>> over time. >>> >>> >>>> I'm not sure which other distributions are common enough and not easily >>>> reproducible by transformation. E.g. negative binomial can be reproduces by >>>> a gamma-poisson mixture. >>>> >>>> On the other hand normal can be easily recreated from standard_normal. >>>> >>> >>> I was mostly motivated by making it a bit easier to mechanically replace >>> uses of randn(), which is probably even more common than normal() and >>> standard_normal() in unit tests. >>> >>> >>>> Would it be difficult to keep this list large, given that it should be >>>> frozen, low maintenance code ? >>>> >>> >>> I admit that I had in mind non-statistical unit tests. That is, tests >>> that didn't depend on the precise distribution of the inputs. >>> >> >> The problem is that the unit test in `stats` rely on precise inputs (up >> to some numerical noise). >> For example p-values themselves are uniformly distributed if the >> hypothesis test works correctly. That mean if I don't have control over the >> inputs, then my p-value could be anything in (0, 1). So either we need a >> real dataset, save all the random numbers in a file or have a reproducible >> set of random numbers. >> >> 95% of the unit tests that I write are for statistics. A large fraction >> of them don't rely on the exact distribution, but do rely on a random >> numbers that are "good enough". >> For example, when writing unit test, then I get every once in a while or >> sometimes more often a "bad" stream of random numbers, for which >> convergence might fail or where the estimated numbers are far away from the >> true numbers, so test tolerance would have to be very high. >> If I pick one of the seeds that looks good, then I can have tighter unit >> test tolerance to insure results are good in a nice case. >> >> The problem is that we cannot write robust unit tests for regression >> tests without stable inputs. >> E.g. I verified my results with a Monte Carlo with 5000 replications and >> 1000 Poisson observations in each. >> Results look close to expected and won't depend much on the exact stream >> of random variables. >> But the Monte Carlo for each variant of the test took about 40 seconds. >> Doing this for all option combination and dataset specification takes too >> long to be feasible in a unit test suite. >> So I rely on numpy's stable random numbers and hard code the results for >> a specific random sample in the regression unit tests. >> >> Josef >> >> >> >>> >>> -- >>> Robert Kern >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Mon Jun 4 00:47:15 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sun, 3 Jun 2018 21:47:15 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Mixed return values of NotImplementedButCoercible and NotImplemented would still result in TypeError, and there would be no second chances for overloads. I would like to differ with you here: It can be quite useful to have second chances for overloads. Think ``np.func(list, custom_array))``: If second rounds did not exist, custom_array would need to have a list of coercible types (which is not nice IMO). It can also help in cases where performance/feature degradation isn?t an issue, so coercing all arguments that returned ``NotImplementedButCoercible`` would allow ``__array_function__`` to succeed where it wouldn?t normally. I mean, that?s one of the major uses of this sentinel right? If done in a for loop, it wouldn?t even slow down the nominal cases. It would have the adverse effect of not allowing for a default implementation to be as simple as you stated, though. One thing we could do is manually (inside ``__array_function__``) coerce anything that didn?t implement ``__array_function__``, and that?s acceptable to me too. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jun 4 00:53:25 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 21:53:25 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers wrote: > It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > Tests that explicitly create RandomState objects would not be difficult to migrate. The goal of "StableState" is that it could be used directly in cases where RandomState is current used in tests, so I would guess that "RandomState" could be almost mechanistically replaced by "StableState". The challenging case are calls to np.random.seed(). If no replacement API is planned, then these would need to be manually converted to use StableState instead. This is probably not too onerous (and is a good cleanup to do anyways) but it would be a bit of work. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 01:03:28 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 22:03:28 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 9:24 PM Charles R Harris wrote: > > On Sat, Jun 2, 2018 at 1:04 PM, Robert Kern wrote: >> >> This policy was first instated in Nov 2008 (in essence; the full set of >> weasel >> > > Instituted? > I meant "instated"; c.f. for another usage: https://www.youredm.com/2018/06/01/spotify-new-policy-update/ But "instituted" would work just as well. It may be that "instated a policy" is just an idiosyncratic back-formation of "reinstated a policy", which even to me feels more right. Not Versioning >> -------------- >> >> For a long time, we considered that the way to allow algorithmic >> improvements >> while maintaining the stream was to apply some form of versioning. That >> is, >> every time we make a stream change in one of the distributions, we >> increment >> some version number somewhere. ``numpy.random`` would keep all past >> versions >> of the code, and there would be a way to get the old versions. Proposals >> of >> how to do this exactly varied widely, but we will not exhaustively list >> them >> here. We spent years going back and forth on these designs and were not >> able >> to find one that sufficed. Let that time lost, and more importantly, the >> contributors that we lost while we dithered, serve as evidence against the >> notion. >> >> Concretely, adding in versioning makes maintenance of ``numpy.random`` >> difficult. Necessarily, we would be keeping lots of versions of the same >> code >> around. Adding a new algorithm safely would still be quite hard. >> >> But most importantly, versioning is fundamentally difficult to *use* >> correctly. >> We want to make it easy and straightforward to get the latest, fastest, >> best >> versions of the distribution algorithms; otherwise, what's the point? >> The way >> to make that easy is to make the latest the default. But the default will >> necessarily change from release to release, so the user?s code would need >> to be >> altered anyway to specify the specific version that one wants to >> replicate. >> >> Adding in versioning to maintain stream-compatibility would still only >> provide >> the same level of stream-compatibility that we currently do, with all of >> the >> limitations described earlier. Given that the standard practice for such >> needs >> is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` >> alone >> is superfluous. >> > > This section is a bit unclear. Would it be correct to say that the rng > version is the numpy version? If so, it might be best to say that up front > before justifying it. > I'm sorry, I'm unclear on what you are asking me to make clearer. There is currently no such thing as "the rng version". The thrust of this section of the NEP is to reject the previously floated idea of introducing the concept at all. So I would certainly not say anything along the lines that "the rng version is the numpy version". I do say, here and earlier, that the way to get the same RNG code is to get the same version of numpy. Mostly off topic, but I note that the new module proposes integers of > various lengths using the Python half open ranges. I would like to suggest > that we modify that just a hair so we can specify the whole range in the > integer interval specification. For instance, the full range of an 8 bit > unsigned integer could be given as `(0, 0)`, i.e., (0, 255 + 1). This would > be most useful for the biggest (64 bit) types, but I am more thinking of > the case where sequences of ranges can be used. > That is indeed something out of scope for this NEP discussion. Feel free to open an issue on the randomgen Github. But suffice it to say that I intend to make sure that the new subsystem has at least feature parity with the current code, and that is one of the features in the current code. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jun 4 01:25:53 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 4 Jun 2018 01:25:53 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 12:53 AM, Stephan Hoyer wrote: > On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers > wrote: > >> It may be worth having a look at test suites for scipy, statsmodels, >> scikit-learn, etc. and estimate how much work this NEP causes those >> projects. If the devs of those packages are forced to do large scale >> migrations from RandomState to StableState, then why not instead keep >> RandomState and just add a new API next to it? >> > > Tests that explicitly create RandomState objects would not be difficult to > migrate. The goal of "StableState" is that it could be used directly in > cases where RandomState is current used in tests, so I would guess that > "RandomState" could be almost mechanistically replaced by "StableState". > > The challenging case are calls to np.random.seed(). If no replacement API > is planned, then these would need to be manually converted to use > StableState instead. This is probably not too onerous (and is a good > cleanup to do anyways) but it would be a bit of work. > I agree with this. Statsmodels uses mostly np.random.seed. That cleanup is planned, but postponed so far as not high priority. We will have to do it eventually. The main work will come when StableState doesn't include specific distribution, Poisson, NegativeBinomial, Gamma, ... and distributions that we don't even use yet, like Beta. I don't want to migrate random number generation for the distributions abandoned by numpy Stable to statsmodels. Josef > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jun 4 01:26:08 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sun, 3 Jun 2018 23:26:08 -0600 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 11:03 PM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 9:24 PM Charles R Harris > wrote: > >> >> On Sat, Jun 2, 2018 at 1:04 PM, Robert Kern >> wrote: >>> >>> This policy was first instated in Nov 2008 (in essence; the full set of >>> weasel >>> >> >> Instituted? >> > > I meant "instated"; c.f. for another usage: https://www.youredm.com/2018/ > 06/01/spotify-new-policy-update/ > > But "instituted" would work just as well. It may be that "instated a > policy" is just an idiosyncratic back-formation of "reinstated a policy", > which even to me feels more right. > > Not Versioning >>> -------------- >>> >>> For a long time, we considered that the way to allow algorithmic >>> improvements >>> while maintaining the stream was to apply some form of versioning. That >>> is, >>> every time we make a stream change in one of the distributions, we >>> increment >>> some version number somewhere. ``numpy.random`` would keep all past >>> versions >>> of the code, and there would be a way to get the old versions. >>> Proposals of >>> how to do this exactly varied widely, but we will not exhaustively list >>> them >>> here. We spent years going back and forth on these designs and were not >>> able >>> to find one that sufficed. Let that time lost, and more importantly, the >>> contributors that we lost while we dithered, serve as evidence against >>> the >>> notion. >>> >>> Concretely, adding in versioning makes maintenance of ``numpy.random`` >>> difficult. Necessarily, we would be keeping lots of versions of the >>> same code >>> around. Adding a new algorithm safely would still be quite hard. >>> >>> But most importantly, versioning is fundamentally difficult to *use* >>> correctly. >>> We want to make it easy and straightforward to get the latest, fastest, >>> best >>> versions of the distribution algorithms; otherwise, what's the point? >>> The way >>> to make that easy is to make the latest the default. But the default >>> will >>> necessarily change from release to release, so the user?s code would >>> need to be >>> altered anyway to specify the specific version that one wants to >>> replicate. >>> >>> Adding in versioning to maintain stream-compatibility would still only >>> provide >>> the same level of stream-compatibility that we currently do, with all of >>> the >>> limitations described earlier. Given that the standard practice for >>> such needs >>> is to pin the release of ``numpy`` as a whole, versioning >>> ``RandomState`` alone >>> is superfluous. >>> >> >> This section is a bit unclear. Would it be correct to say that the rng >> version is the numpy version? If so, it might be best to say that up front >> before justifying it. >> > > I'm sorry, I'm unclear on what you are asking me to make clearer. There is > currently no such thing as "the rng version". The thrust of this section of > the NEP is to reject the previously floated idea of introducing the concept > at all. So I would certainly not say anything along the lines that "the rng > version is the numpy version". I do say, here and earlier, that the way to > get the same RNG code is to get the same version of numpy. > Just so, and you could make that clearer, as you do here. > > Mostly off topic, but I note that the new module proposes integers of >> various lengths using the Python half open ranges. I would like to suggest >> that we modify that just a hair so we can specify the whole range in the >> integer interval specification. For instance, the full range of an 8 bit >> unsigned integer could be given as `(0, 0)`, i.e., (0, 255 + 1). This would >> be most useful for the biggest (64 bit) types, but I am more thinking of >> the case where sequences of ranges can be used. >> > > That is indeed something out of scope for this NEP discussion. Feel free > to open an issue on the randomgen Github. But suffice it to say that I > intend to make sure that the new subsystem has at least feature parity with > the current code, and that is one of the features in the current code. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 01:47:34 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 22:47:34 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 10:29 PM Charles R Harris wrote: > > > On Sun, Jun 3, 2018 at 11:03 PM, Robert Kern > wrote: > >> On Sun, Jun 3, 2018 at 9:24 PM Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> On Sat, Jun 2, 2018 at 1:04 PM, Robert Kern >>> wrote: >>>> >>>> This policy was first instated in Nov 2008 (in essence; the full set of >>>> weasel >>>> >>> >>> Instituted? >>> >> >> I meant "instated"; c.f. for another usage: >> https://www.youredm.com/2018/06/01/spotify-new-policy-update/ >> >> But "instituted" would work just as well. It may be that "instated a >> policy" is just an idiosyncratic back-formation of "reinstated a policy", >> which even to me feels more right. >> >> Not Versioning >>>> -------------- >>>> >>>> For a long time, we considered that the way to allow algorithmic >>>> improvements >>>> while maintaining the stream was to apply some form of versioning. >>>> That is, >>>> every time we make a stream change in one of the distributions, we >>>> increment >>>> some version number somewhere. ``numpy.random`` would keep all past >>>> versions >>>> of the code, and there would be a way to get the old versions. >>>> Proposals of >>>> how to do this exactly varied widely, but we will not exhaustively list >>>> them >>>> here. We spent years going back and forth on these designs and were >>>> not able >>>> to find one that sufficed. Let that time lost, and more importantly, >>>> the >>>> contributors that we lost while we dithered, serve as evidence against >>>> the >>>> notion. >>>> >>>> Concretely, adding in versioning makes maintenance of ``numpy.random`` >>>> difficult. Necessarily, we would be keeping lots of versions of the >>>> same code >>>> around. Adding a new algorithm safely would still be quite hard. >>>> >>>> But most importantly, versioning is fundamentally difficult to *use* >>>> correctly. >>>> We want to make it easy and straightforward to get the latest, fastest, >>>> best >>>> versions of the distribution algorithms; otherwise, what's the point? >>>> The way >>>> to make that easy is to make the latest the default. But the default >>>> will >>>> necessarily change from release to release, so the user?s code would >>>> need to be >>>> altered anyway to specify the specific version that one wants to >>>> replicate. >>>> >>>> Adding in versioning to maintain stream-compatibility would still only >>>> provide >>>> the same level of stream-compatibility that we currently do, with all >>>> of the >>>> limitations described earlier. Given that the standard practice for >>>> such needs >>>> is to pin the release of ``numpy`` as a whole, versioning >>>> ``RandomState`` alone >>>> is superfluous. >>>> >>> >>> This section is a bit unclear. Would it be correct to say that the rng >>> version is the numpy version? If so, it might be best to say that up front >>> before justifying it. >>> >> >> I'm sorry, I'm unclear on what you are asking me to make clearer. There >> is currently no such thing as "the rng version". The thrust of this section >> of the NEP is to reject the previously floated idea of introducing the >> concept at all. So I would certainly not say anything along the lines that >> "the rng version is the numpy version". I do say, here and earlier, that >> the way to get the same RNG code is to get the same version of numpy. >> > > Just so, and you could make that clearer, as you do here. > I don't understand. All I did was repeat what I already said twice. If you'd like to provide some text that would have clarified things for you, I'll see about inserting it, but I'm at a loss for writing that text. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Mon Jun 4 01:55:17 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sun, 3 Jun 2018 22:55:17 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: How about this: "There will be no concept of a separate RNG version. In order to get consistent or reproducible results from the RNG, it will be necessary to specify the NumPy version that was used to generate those results. Results from the RNG may change across different releases of Num Py." Sent from Astro for Mac On 4. Jun 2018 at 10:47, Robert Kern wrote: On Sun, Jun 3, 2018 at 10:29 PM Charles R Harris wrote: > > > On Sun, Jun 3, 2018 at 11:03 PM, Robert Kern > wrote: > >> On Sun, Jun 3, 2018 at 9:24 PM Charles R Harris < >> charlesr.harris at gmail.com> wrote: >> >>> >>> On Sat, Jun 2, 2018 at 1:04 PM, Robert Kern >>> wrote: >>>> >>>> This policy was first instated in Nov 2008 (in essence; the full set of >>>> weasel >>>> >>> >>> Instituted? >>> >> >> I meant "instated"; c.f. for another usage: >> https://www.youredm.com/2018/06/01/spotify-new-policy-update/ >> >> But "instituted" would work just as well. It may be that "instated a >> policy" is just an idiosyncratic back-formation of "reinstated a policy", >> which even to me feels more right. >> >> Not Versioning >>>> -------------- >>>> >>>> For a long time, we considered that the way to allow algorithmic >>>> improvements >>>> while maintaining the stream was to apply some form of versioning. >>>> That is, >>>> every time we make a stream change in one of the distributions, we >>>> increment >>>> some version number somewhere. ``numpy.random`` would keep all past >>>> versions >>>> of the code, and there would be a way to get the old versions. >>>> Proposals of >>>> how to do this exactly varied widely, but we will not exhaustively list >>>> them >>>> here. We spent years going back and forth on these designs and were >>>> not able >>>> to find one that sufficed. Let that time lost, and more importantly, >>>> the >>>> contributors that we lost while we dithered, serve as evidence against >>>> the >>>> notion. >>>> >>>> Concretely, adding in versioning makes maintenance of ``numpy.random`` >>>> difficult. Necessarily, we would be keeping lots of versions of the >>>> same code >>>> around. Adding a new algorithm safely would still be quite hard. >>>> >>>> But most importantly, versioning is fundamentally difficult to *use* >>>> correctly. >>>> We want to make it easy and straightforward to get the latest, fastest, >>>> best >>>> versions of the distribution algorithms; otherwise, what's the point? >>>> The way >>>> to make that easy is to make the latest the default. But the default >>>> will >>>> necessarily change from release to release, so the user?s code would >>>> need to be >>>> altered anyway to specify the specific version that one wants to >>>> replicate. >>>> >>>> Adding in versioning to maintain stream-compatibility would still only >>>> provide >>>> the same level of stream-compatibility that we currently do, with all >>>> of the >>>> limitations described earlier. Given that the standard practice for >>>> such needs >>>> is to pin the release of ``numpy`` as a whole, versioning >>>> ``RandomState`` alone >>>> is superfluous. >>>> >>> >>> This section is a bit unclear. Would it be correct to say that the rng >>> version is the numpy version? If so, it might be best to say that up front >>> before justifying it. >>> >> >> I'm sorry, I'm unclear on what you are asking me to make clearer. There >> is currently no such thing as "the rng version". The thrust of this section >> of the NEP is to reject the previously floated idea of introducing the >> concept at all. So I would certainly not say anything along the lines that >> "the rng version is the numpy version". I do say, here and earlier, that >> the way to get the same RNG code is to get the same version of numpy. >> > > Just so, and you could make that clearer, as you do here. > I don't understand. All I did was repeat what I already said twice. If you'd like to provide some text that would have clarified things for you, I'll see about inserting it, but I'm at a loss for writing that text. -- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevin.k.sheppard at gmail.com Mon Jun 4 02:05:56 2018 From: kevin.k.sheppard at gmail.com (Kevin Sheppard) Date: Mon, 4 Jun 2018 07:05:56 +0100 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: <5b14d6c4.1c69fb81.21ba2.13ed@mx.google.com> The seed() discussion seems unnecessary. StableRandom will need to have a method to set/get state which can be used by any project that needs to get reproducible numbers from the module-level generator. While this is an implementation detail, many generators have much smaller states than MT19937 (a few uint64s). So this is easy enough to hard code where needed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 02:18:21 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 23:18:21 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: <5b14d6c4.1c69fb81.21ba2.13ed@mx.google.com> References: <5b14d6c4.1c69fb81.21ba2.13ed@mx.google.com> Message-ID: On Sun, Jun 3, 2018 at 11:07 PM Kevin Sheppard wrote: > The seed() discussion seems unnecessary. StableRandom will need to have a > method to set/get state > > which can be used by any project that needs to get reproducible numbers > from the module-level generator. > > > > While this is an implementation detail, many generators have much smaller > states than MT19937 > > (a few uint64s). So this is easy enough to hard code where needed. > The question isn't about what .seed() methods look like on the new generators. Rather, it's about the behavior when code calls numpy.random.seed() then numpy.random.uniform() (or one of the other convenience aliases). Specifically, there will be a period of time when RandomState is merely deprecated but is still expected to be there and be fully backwards-compatible to give reproducible streams. Does that expectation extend to code that uses numpy.random.seed() to get that reproducibility? What happens with code that just calls numpy.random.uniform(): does it use RandomState or the new code? These questions are probably in-scope for this NEP, but I'd like to get some kind of consensus on the rest first, as the higher level decisions will tell us more about what we want to do for numpy.random.seed(). -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jun 4 02:19:51 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 3 Jun 2018 23:19:51 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 9:54 PM Hameer Abbasi wrote: > Mixed return values of NotImplementedButCoercible and NotImplemented would > still result in TypeError, and there would be no second chances for > overloads. > > > I would like to differ with you here: It can be quite useful to have > second chances for overloads. Think ``np.func(list, custom_array))``: If > second rounds did not exist, custom_array would need to have a list of > coercible types (which is not nice IMO). > Even if we did this, we would still want to preserve the equivalence between: 1. Returning NotImplementedButCoercible from __array_ufunc__ or __array_function__, and 2. Not implementing __array_ufunc__ or __array_function__ at all. Changing __array_ufunc__ to do multiple rounds of checks could indeed be useful in some cases, and you're right that it would not change existing behavior (in these cases we currently raise TypeError). But I'd rather leave that for a separate discussion, because it's orthogonal to our proposal here for __array_function__. (Personally, I don't think it would be worth the additional complexity.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 02:22:57 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 3 Jun 2018 23:22:57 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 10:27 PM wrote: > > > On Mon, Jun 4, 2018 at 12:53 AM, Stephan Hoyer wrote: > >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers >> wrote: >> >>> It may be worth having a look at test suites for scipy, statsmodels, >>> scikit-learn, etc. and estimate how much work this NEP causes those >>> projects. If the devs of those packages are forced to do large scale >>> migrations from RandomState to StableState, then why not instead keep >>> RandomState and just add a new API next to it? >>> >> >> Tests that explicitly create RandomState objects would not be difficult >> to migrate. The goal of "StableState" is that it could be used directly in >> cases where RandomState is current used in tests, so I would guess that >> "RandomState" could be almost mechanistically replaced by "StableState". >> >> The challenging case are calls to np.random.seed(). If no replacement API >> is planned, then these would need to be manually converted to use >> StableState instead. This is probably not too onerous (and is a good >> cleanup to do anyways) but it would be a bit of work. >> > > I agree with this. Statsmodels uses mostly np.random.seed. That cleanup is > planned, but postponed so far as not high priority. We will have to do it > eventually. > > The main work will come when StableState doesn't include specific > distribution, Poisson, NegativeBinomial, Gamma, ... and distributions that > we don't even use yet, like Beta. > I would posit that it is probably very rare that one uses the full breadth of distributions in unit tests. You may be the only one. :-) > I don't want to migrate random number generation for the distributions > abandoned by numpy Stable to statsmodels. > What if we followed Kevin's suggestion and forked off RandomState into its own forever-frozen package sooner rather than later? It's intended use would be for people with legacy packages that cannot upgrade (other than changing some imports) and for unit tests that require precise streams for a full breadth of distributions. We would still leave it in numpy.random for a deprecation period, but maybe we would be noisy about it sooner and remove it sooner than my NEP planned for. Would that work? I'd be happy to maintain that forked-RandomState for you. I would probably still encourage most people to continue to use StableRandom for most unit testing. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From antoine at python.org Mon Jun 4 04:50:52 2018 From: antoine at python.org (Antoine Pitrou) Date: Mon, 4 Jun 2018 10:50:52 +0200 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> Message-ID: <92186546-e2b6-062a-b446-54704af065bf@python.org> Hi, Do you plan to consider trying to add PEP 574 / pickle5 support? There's an implementation ready (and a PyPI backport) that you can play with. https://www.python.org/dev/peps/pep-0574/ PEP 574 implicits targets Numpy arrays as one of its primary producers, since Numpy arrays is how large scientific or numerical data often ends up represented and where zero-copy is often desired by users. PEP 574 could certainly be useful even without Numpy arrays supporting it, but less so. So I would welcome any feedback on that front (and, given that I'd like PEP 574 to be accepted in time for Python 3.8, I'd ideally like to have that feedback sometimes in the forthcoming months ;-)). Best regards Antoine. On Thu, 31 May 2018 16:50:02 -0700 Matti Picus wrote: > At the recent NumPy sprint at BIDS (thanks to those who made the trip) > we spent some time brainstorming about a roadmap for NumPy, in the > spirit of similar work that was done for Jupyter. The idea is that a > document with wide community acceptance can guide the work of the > full-time developer(s), and be a source of ideas for expanding > development efforts. > > I put the document up at > https://github.com/numpy/numpy/wiki/NumPy-Roadmap, and hope to discuss > it at a BOF session during SciPy in the middle of July in Austin. > > Eventually it could become a NEP or formalized in another way. > > Matti From kevin.k.sheppard at gmail.com Mon Jun 4 05:54:27 2018 From: kevin.k.sheppard at gmail.com (Kevin Sheppard) Date: Mon, 4 Jun 2018 09:54:27 +0000 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: I?m not sure if this is within the scope of the NEP or an implementation detail, but I think a new PRNG should use platform independent integer types rather than depending on the platform?s choice of 64-bit data model. This should be enough to ensure that any integer distribution that only uses integers internally should produce identical results across uarch/OS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harrigan.matthew at gmail.com Mon Jun 4 07:28:09 2018 From: harrigan.matthew at gmail.com (Matthew Harrigan) Date: Mon, 4 Jun 2018 07:28:09 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Should there be discussion of typing (pep-484) or abstract base classes in this nep? Are there any requirements on the result returned by __array_function__? On Mon, Jun 4, 2018, 2:20 AM Stephan Hoyer wrote: > > On Sun, Jun 3, 2018 at 9:54 PM Hameer Abbasi > wrote: > >> Mixed return values of NotImplementedButCoercible and NotImplemented >> would still result in TypeError, and there would be no second chances for >> overloads. >> >> >> I would like to differ with you here: It can be quite useful to have >> second chances for overloads. Think ``np.func(list, custom_array))``: If >> second rounds did not exist, custom_array would need to have a list of >> coercible types (which is not nice IMO). >> > > Even if we did this, we would still want to preserve the equivalence > between: > 1. Returning NotImplementedButCoercible from __array_ufunc__ or > __array_function__, and > 2. Not implementing __array_ufunc__ or __array_function__ at all. > > Changing __array_ufunc__ to do multiple rounds of checks could indeed be > useful in some cases, and you're right that it would not change existing > behavior (in these cases we currently raise TypeError). But I'd rather > leave that for a separate discussion, because it's orthogonal to our > proposal here for __array_function__. > > (Personally, I don't think it would be worth the additional complexity.) > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jun 4 08:29:26 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 4 Jun 2018 08:29:26 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 2:22 AM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 10:27 PM wrote: > >> >> >> On Mon, Jun 4, 2018 at 12:53 AM, Stephan Hoyer wrote: >> >>> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers >>> wrote: >>> >>>> It may be worth having a look at test suites for scipy, statsmodels, >>>> scikit-learn, etc. and estimate how much work this NEP causes those >>>> projects. If the devs of those packages are forced to do large scale >>>> migrations from RandomState to StableState, then why not instead keep >>>> RandomState and just add a new API next to it? >>>> >>> >>> Tests that explicitly create RandomState objects would not be difficult >>> to migrate. The goal of "StableState" is that it could be used directly in >>> cases where RandomState is current used in tests, so I would guess that >>> "RandomState" could be almost mechanistically replaced by "StableState". >>> >>> The challenging case are calls to np.random.seed(). If no replacement >>> API is planned, then these would need to be manually converted to use >>> StableState instead. This is probably not too onerous (and is a good >>> cleanup to do anyways) but it would be a bit of work. >>> >> >> I agree with this. Statsmodels uses mostly np.random.seed. That cleanup >> is planned, but postponed so far as not high priority. We will have to do >> it eventually. >> >> The main work will come when StableState doesn't include specific >> distribution, Poisson, NegativeBinomial, Gamma, ... and distributions that >> we don't even use yet, like Beta. >> > > I would posit that it is probably very rare that one uses the full breadth > of distributions in unit tests. You may be the only one. :-) > Given that I'm one of the maintainers for Statistics in Python, I wouldn't be surprised if I would use more than almost all others. However, statsmodels doesn't use a very large set, there are other packages that use Pareto and Extreme Value distributions or circular distributions like vonmises which are not yet in statsmodels. I have no idea about whether MCMC packages still rely on numpy.random. But the main "user" of numpy's random is scipy.stats which might be using almost all of the distributions. I don't have a current overview about how much scipy.stats unit tests rely on having stable streams for the available distributions. > > >> I don't want to migrate random number generation for the distributions >> abandoned by numpy Stable to statsmodels. >> > > What if we followed Kevin's suggestion and forked off RandomState into its > own forever-frozen package sooner rather than later? It's intended use > would be for people with legacy packages that cannot upgrade (other than > changing some imports) and for unit tests that require precise streams for > a full breadth of distributions. We would still leave it in numpy.random > for a deprecation period, but maybe we would be noisy about it sooner and > remove it sooner than my NEP planned for. > > Would that work? I'd be happy to maintain that forked-RandomState for you. > It would not be nice to have to add another dependency, but that would work for statsmodels. I'm not sure whether scipy.stats maintainers are fine with it. Given that scipy already uses RandomState instead of the global instance, the actual change if distributions are available would be to swap a StableState for a RandomState in the unit tests, AFAIK. Josef > > I would probably still encourage most people to continue to use > StableRandom for most unit testing. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Mon Jun 4 10:34:49 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 4 Jun 2018 10:34:49 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hi Stephan, Another potential consideration in favor of NotImplementedButCoercible is > for subclassing: we could use it to write the default implementations of > ndarray.__array_ufunc__ and ndarray.__array_function__, e.g., > > class ndarray: > def __array_ufunc__(self, *args, **kwargs): > return NotIImplementedButCoercible > def __array_function__(self, *args, **kwargs): > return NotIImplementedButCoercible > > I think (not 100% sure yet) this would result in exactly equivalent > behavior to what ndarray.__array_ufunc__ currently does: > http://www.numpy.org/neps/nep-0013-ufunc-overrides.html# > subclass-hierarchies > As written would not work for ndarray subclasses, because the subclass will generically change itself before calling super. At least for Quantity, say if I add two quantities, the quantities will both be converted to arrays (with one scaled so that the units match) and then the super call is done with those modified arrays. This expects that the super call will actually return a result (which it now can because all inputs are arrays). But I think it would work to return `NotImplementedButCoercible` in the case that perhaps you had in mind in the first place, in which any of the *other* arguments had a `__array_ufunc__` implementation and `ndarray` thus does not know what to do. For those cases, `ndarray` currently returns a straight `NotImplemented`. Though I am still a bit worried: this gets back to `Quantity.__array_ufunc__`, but what does it do with it? It cannot just pass it on, since then it is effectively telling, incorrectly, that the *quantity* is coercible, which it is not. I guess at this point it would have to change it to `NotImplemented`. Looking at my current implementation, I see that if we made this change to `ndarray.__array_ufunc__`, the implementation would mostly raise an exception as it tried to view `NotImplementedButCoercible` as a quantity, except for comparisons, where the output is not viewed at all (being boolean and thus unit-less) and passed straight down. That said, we've said the __array_ufunc__ implementation is experimental, so I think such small annoyances are OK. Overall, it is an intriguing idea, and I think it should be mentioned at least in the NEP. It would be good, though, to have a few more examples of how it would work in practice. All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Mon Jun 4 10:37:19 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 4 Jun 2018 10:37:19 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: I agree that second rounds of overloads have to be left to the implementers of `__array_function__` - obviously, though, we should be sure that these rounds are rarely necessary... The link posted by Stephan [1] has some decent discussion for `__array_ufunc__` about when an override should re-call the function rather than try to do something itself. -- Marten [1] http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jun 4 11:09:35 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 4 Jun 2018 08:09:35 -0700 Subject: [Numpy-discussion] A roadmap for NumPy - longer term planning In-Reply-To: <92186546-e2b6-062a-b446-54704af065bf@python.org> References: <69cf3275-26f3-deec-d499-b56204d96c60@gmail.com> <92186546-e2b6-062a-b446-54704af065bf@python.org> Message-ID: PEP-574 isn't on the roadmap (yet!), but I think we would clearly welcome it. Like all NumPy improvements, it would need to implemented by an interested party. On Mon, Jun 4, 2018 at 1:52 AM Antoine Pitrou wrote: > > Hi, > > Do you plan to consider trying to add PEP 574 / pickle5 support? There's > an implementation ready (and a PyPI backport) that you can play with. > https://www.python.org/dev/peps/pep-0574/ > > PEP 574 implicits targets Numpy arrays as one of its primary producers, > since Numpy arrays is how large scientific or numerical data often ends > up represented and where zero-copy is often desired by users. > > PEP 574 could certainly be useful even without Numpy arrays supporting > it, but less so. So I would welcome any feedback on that front (and, > given that I'd like PEP 574 to be accepted in time for Python 3.8, I'd > ideally like to have that feedback sometimes in the forthcoming months > ;-)). > > Best regards > > Antoine. > > > On Thu, 31 May 2018 16:50:02 -0700 > Matti Picus wrote: > > At the recent NumPy sprint at BIDS (thanks to those who made the trip) > > we spent some time brainstorming about a roadmap for NumPy, in the > > spirit of similar work that was done for Jupyter. The idea is that a > > document with wide community acceptance can guide the work of the > > full-time developer(s), and be a source of ideas for expanding > > development efforts. > > > > I put the document up at > > https://github.com/numpy/numpy/wiki/NumPy-Roadmap, and hope to discuss > > it at a BOF session during SciPy in the middle of July in Austin. > > > > Eventually it could become a NEP or formalized in another way. > > > > Matti > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 13:58:59 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 4 Jun 2018 10:58:59 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 2:55 AM Kevin Sheppard wrote: > I?m not sure if this is within the scope of the NEP or an implementation > detail, but I think a new PRNG should use platform independent integer > types rather than depending on the platform?s choice of 64-bit data model. > This should be enough to ensure that any integer distribution that only > uses integers internally should produce identical results across uarch/OS. > Probably an implementation detail (possibly one that ought to be worked out in its own NEP). I know that I would like it if the new system had all of the same distribution methods as RandomState currently does, such that we can drop in the new generator objects in places where RandomState is currently used, and everything would still work (just with a different stream). Might want to add a statement to that effect in this NEP. I think it's likely "good enough" if the integer distributions now return uint64 arrays instead of uint32 arrays on Windows. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 4 18:18:25 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 4 Jun 2018 15:18:25 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers wrote: > It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > The problem is that we can't really have an ecosystem with two different general purpose systems. To properly use pseudorandom numbers, I need to instantiate a PRNG and thread it through all of the code in my program: both the parts that I write and the third party libraries that I don't write. Generating test data for unit tests is separable, though. That's why I propose having a StableRandom built on the new architecture. Its purpose would be well-documented, and in my proposal is limited in features such that it will be less likely to be abused outside of that purpose. If you make it fully-featured, it is more likely to be abused by building library code around it. But even if it is so abused, because it is built on the new architecture, at least I can thread the same core PRNG state through the StableRandom distributions from the abusing library and use the better distributions class elsewhere (randomgen names it "Generator"). Just keeping RandomState around can't work like that because it doesn't have a replaceable core PRNG. But that does suggest another alternative that we should explore: The new architecture separates the core uniform PRNG from the wide variety of non-uniform probability distributions. That is, the core PRNG state is encapsulated in a discrete object that can be shared between instances of different distribution-providing classes. numpy.random should provide two such distribution-providing classes. The main one (let us call it ``Generator``, as it is called in the prototype) will follow the new policy: distribution methods can break the stream in feature releases. There will also be a secondary distributions class (let us call it ``LegacyGenerator``) which contains distribution methods exactly as they exist in the current ``RandomState`` implementation. When one combines ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the exact same stream as ``RandomState`` for all distribution methods. The ``LegacyGenerator`` methods will be forever frozen. ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with the MT19937 core PRNG, and whatever tricks needed to make ``isinstance(prng, RandomState)`` and unpickling work should be done. This way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be deprecated, becoming progressively noisier over a number of release cycles, in favor of explicitly instantiating ``LegacyGenerator``. ``LegacyGenerator`` CAN be used during this deprecation period in library and application code until libraries and applications can migrate to the new ``Generator``. Libraries and applications SHOULD migrate but MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test data for unit tests where cross-release stability of the streams is important. Test writers SHOULD consider ways to mitigate their reliance on such stability and SHOULD limit their usage to distribution methods that have fewer cross-platform stability risks. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 5 10:56:42 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 5 Jun 2018 08:56:42 -0600 Subject: [Numpy-discussion] NumPy 1.14.4 release Message-ID: Hi All, The release notes for the NumPy 1.14.4 release are up. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 5 14:34:18 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 5 Jun 2018 11:34:18 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 5:39 AM Matthew Harrigan wrote: > Should there be discussion of typing (pep-484) or abstract base classes in > this nep? Are there any requirements on the result returned by > __array_function__? > This is a good question that should be addressed in the NEP. Currently, we impose no limitations on the types returned by __array_function__ (or __array_ufunc__, for that matter). Given the complexity of potential __array_function__ implementations, I think this would be hard/impossible to do in general. I think the best case scenario we could hope for is that type checkers would identify that result of NumPy functions as: - numpy.ndarray if all inputs are numpy.ndarray objects - Any if any non-numpy.ndarray inputs implement the __array_function__ Based on my understanding of proposed rules for typing protocols [1] and overloads [2], I think this could just work, e.g., @overload def func(array: np.ndarray) -> np.ndarray: ... @overload def func(array: ImplementsArrayFunction) -> Any: ... [1] https://www.python.org/dev/peps/pep-0544/ [2] https://github.com/python/typing/issues/253#issuecomment-389262904 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 5 14:49:00 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 5 Jun 2018 11:49:00 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 7:35 AM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > Hi Stephan, > > Another potential consideration in favor of NotImplementedButCoercible is >> for subclassing: we could use it to write the default implementations of >> ndarray.__array_ufunc__ and ndarray.__array_function__, e.g., >> >> class ndarray: >> def __array_ufunc__(self, *args, **kwargs): >> return NotIImplementedButCoercible >> def __array_function__(self, *args, **kwargs): >> return NotIImplementedButCoercible >> >> I think (not 100% sure yet) this would result in exactly equivalent >> behavior to what ndarray.__array_ufunc__ currently does: >> >> http://www.numpy.org/neps/nep-0013-ufunc-overrides.html#subclass-hierarchies >> > > As written would not work for ndarray subclasses, because the subclass > will generically change itself before calling super. At least for Quantity, > say if I add two quantities, the quantities will both be converted to > arrays (with one scaled so that the units match) and then the super call is > done with those modified arrays. This expects that the super call will > actually return a result (which it now can because all inputs are arrays). > Thanks for clarifying. This is definitely trickier than I had thought. If Quantity.__array_ufunc__ implemented overrides by calling the public ufunc method again (instead of calling super), then it would still work fine with this change. But of course, in that case you would not need ndarray.__array_ufunc__ defined at all. I will say that personally, I find the complexity of the current ndarray.__array_ufunc__ implementation a little inelegant, and I would welcome simplifying it. But I also try to avoid implementation inheritance entirely [2], for exactly the same reasons why refactoring ndarray.__array_ufunc__ here would be difficult (inheritance is fragile). So I would be happy to defer to your judgment, as someone who actually uses subclassing. https://hackernoon.com/inheritance-based-on-internal-structure-is-evil-7474cc8e64dc -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Tue Jun 5 15:33:40 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Tue, 5 Jun 2018 15:33:40 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hi Stephan, Things would, I think, make much more sense if `ndarray.__array_ufunc__` (or `*_function__`) actually *were* the implementation for array-only. But while that is something I'd like to eventually get to, it seems out of scope for the current discussion. But we should be sure that the ndarray versions return either `NotImplemented` or a result. Given that, I think that perhaps it is also best not to do `NotImplementedButCoercible` - as I think the implementers of `__array_function__` perhaps should just do that themselves. But I may well swing the other way again... Good examples of non-trivial benefits would help. All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 5 17:11:23 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 5 Jun 2018 14:11:23 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Tue, Jun 5, 2018 at 12:35 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > Things would, I think, make much more sense if `ndarray.__array_ufunc__` > (or `*_function__`) actually *were* the implementation for array-only. But > while that is something I'd like to eventually get to, it seems out of > scope for the current discussion. > If this is a desirable end-state, we should at least consider it now while we are designing the __array_function__ interface. With the current proposal, I think this would be nearly impossible. The challenge is that ndarray.__array_function__ would somehow need to call the non-overloaded version of the provided function provided that no other arguments overload __array_function__. However, currently don't expose this information in any way. Some ways this could be done (including some of your prior suggestions): - Add a coerce=True argument to all NumPy functions, which could be used by non-overloaded implementations. - A separate namespace for non-overloaded functions (e.g., numpy.array_only). - Adding another argument to the __array_function__ interface to explicitly provide the non-overloaded implementation (e.g., func_impl). I don't like any of these options and I'm not sure I agree with your goal, but the NEP should make clear that we are precluding this possibility. Given that, I think that perhaps it is also best not to do > `NotImplementedButCoercible` - as I think the implementers of > `__array_function__` perhaps should just do that themselves. But I may well > swing the other way again... Good examples of non-trivial benefits would > help. > This would also be my default stance, and of course we can always add NotImplementedButCoercible later. I can think of two main use cases: 1. Libraries that only want to overload *some* NumPy functions, but want the rest of NumPy's API by coercing arguments to NumPy arrays. 2. Library that want to eventually overload all of NumPy's high level API, but need to do so incrementally, in a way that preserves backwards compatibility. I'm not sure I agree with use case 1. Arguably, libraries that only overload a limited part of NumPy's API shouldn't encourage their users their users to rely on it. This state of affairs is pretty confusing to users. However, case 2 is valid and potentially important. Consider the case of a library with existing users that would like to start implementing __array_function__ (e.g., dask, astropy, xarray, pandas). The right strategy really depends upon whether the library considers the current behavior of NumPy functions on their objects (silent coercion to numpy arrays) a feature or a bug: - If coercion is a bug and something that the library never intended to support, then perhaps it would be OK to suddenly change all existing overloads to return the correct type. - However, if coercion is a feature (which is probably the attitude of at least some users), ideally there really should be a graceful way to enable the new overloaded behavior incrementally. For example, a library might want to start issuing FutureWarning in version X, before switching over to the new overloaded behavior in version X+1. I can't think of how to do this without NotImplementedButCoercible. For projects like dask and xarray, the benefits of __array_function__ are so large that we will accept a hard transition that breaks some user code without warning. But this may not be the case for other projects. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Tue Jun 5 17:31:38 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Tue, 5 Jun 2018 17:31:38 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hi Stephan, On `NotImplementedButCoercible`: don't forget that even a preliminary implementation of `__array_function__` has always the choice of coercing its own instances to ndarray and re-calling the function; that is really no different from (though probably a bit slower than) what would happen if one returned NIBC. It does require, however, a fairly efficient way of finding arguments of one's own class, which is partially why I think it is important for there to be a quick way to find instances of one's own type; we should try to avoid having people to reimplement the dance. It may still be that `types` is the right vehicle for this - it just depends on how much of the state of the dance it carries. ? On the "separate" name-space question: one thing it is not is particularly difficult, especially if one works with a decorator: effectively one already has the original function and the wrapped one; the only question is whether it would pay to keep the original one around somewhere. I do continue to think that we will get grumbling about regressions in speed and that it would help to have the undecorated versions available. Though in my ideal world those would do no coercing whatsoever, but just take arrays, i.e., they are actually faster than the current ones. All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Jun 5 17:43:10 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 5 Jun 2018 14:43:10 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On 05/06/18 14:11, Stephan Hoyer wrote: > On Tue, Jun 5, 2018 at 12:35 PM Marten van Kerkwijk > > wrote: > > Things would, I think, make much more sense if > `ndarray.__array_ufunc__` (or `*_function__`) actually *were* the > implementation for array-only. But while that is something I'd > like to eventually get to, it seems out of scope for the current > discussion. > > > If this is a desirable end-state, we should at least consider it now > while we are designing the __array_function__ interface. > > With the current proposal, I think this would be nearly impossible. > The challenge is that ndarray.__array_function__ would somehow need to > call the non-overloaded version of the provided function provided that > no other arguments overload __array_function__. However, currently > don't expose this information in any way. > > Some ways this could be done (including some of your prior suggestions): > - Add?a coerce=True argument to all NumPy functions, which could be > used by non-overloaded implementations. > - A separate namespace for non-overloaded functions (e.g., > numpy.array_only). > - Adding another argument to the __array_function__ interface to > explicitly provide the non-overloaded implementation (e.g., func_impl). > > I don't like any of these options and I'm not sure I agree with your > goal, but the NEP should make clear that we are precluding this > possibility. > What is the difference between the `func` provided as the first argument to `__array_function__` and `__array_ufunc__` and the "non-overloaded version of the provided function"? This NEP calls it an "arbitrary callable". In `__array_ufunc__` it turns out people count on it being exactly the `np.ufunc`. Matti From shoyer at gmail.com Tue Jun 5 18:03:32 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 5 Jun 2018 15:03:32 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Tue, Jun 5, 2018 at 2:47 PM Matti Picus wrote: > What is the difference between the `func` provided as the first argument > to `__array_function__` and `__array_ufunc__` and the "non-overloaded > version of the provided function"? > The ""non-overloaded version of the provided function" is entirely hypothetical at this point. If we use a decorator to implement overloads, it would be the undecorated function, e.g., the original definition of concatenate here: @overload_for_array_function(['arrays', 'out'])def concatenate(arrays, axis=0, out=None): ... # continue with the definition of concatenate This NEP calls it an "arbitrary callable". > In `__array_ufunc__` it turns out people count on it being exactly the > `np.ufunc`. Right, I think this is good guarantee to provide. Certainly it's one that people fine useful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Tue Jun 5 20:06:37 2018 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Tue, 5 Jun 2018 17:06:37 -0700 Subject: [Numpy-discussion] 2018 John Hunter Excellence in Plotting Contest Message-ID: Hello everyone, Sorry about the cross-posting. There's a couple more days to submit to the John Hunter Excellence in Plotting Competition! If you have any scientific plot worth sharing, submit an entry before June 8th. For more information, see below. Thanks, Nelle In memory of John Hunter, we are pleased to be reviving the SciPy John Hunter Excellence in Plotting Competition for 2018. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at the conference. John Hunter?s family and NumFocus are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June, 8th to the form at https://goo.gl/forms/7q86zgu5OYUOjODH3 . - Winners will be announced at Scipy 2018 in Austin, TX. - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. This may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). SciPy John Hunter Excellence in Plotting Competition Co-Chairs Thomas Caswell Michael Droettboom Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Tue Jun 5 20:32:49 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Tue, 5 Jun 2018 20:32:49 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Yes, the function should definitely be the same as what the user called - i.e., the decorated function. I'm only wondering if it would also be possible to have access to the undecorated one (via `coerce` or `ndarray.__array_function__` or otherwise). -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Tue Jun 5 20:39:33 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Tue, 5 Jun 2018 19:39:33 -0500 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hmm, does this mean the callable that gets passed into __array_ufunc__ will change? I'm pretty sure that will break the dispatch mechanism I'm using in my __array_ufunc__ implementation, which directly checks whether the callable is in one of several tuples of functions that have different behavior. On Tue, Jun 5, 2018 at 7:32 PM, Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > Yes, the function should definitely be the same as what the user called - > i.e., the decorated function. I'm only wondering if it would also be > possible to have access to the undecorated one (via `coerce` or > `ndarray.__array_function__` or otherwise). > -- Marten > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Tue Jun 5 20:41:25 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Tue, 5 Jun 2018 19:41:25 -0500 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Oh wait, since the decorated version of the ufunc will be the one in the public numpy API it won't break. It would only break if the callable that was passed in *wasn't* the decorated version, so it kinda *has* to pass in the decorated function to preserve backward compatibility. Apologies for the noise. On Tue, Jun 5, 2018 at 7:39 PM, Nathan Goldbaum wrote: > Hmm, does this mean the callable that gets passed into __array_ufunc__ > will change? I'm pretty sure that will break the dispatch mechanism I'm > using in my __array_ufunc__ implementation, which directly checks whether > the callable is in one of several tuples of functions that have different > behavior. > > On Tue, Jun 5, 2018 at 7:32 PM, Marten van Kerkwijk < > m.h.vankerkwijk at gmail.com> wrote: > >> Yes, the function should definitely be the same as what the user called - >> i.e., the decorated function. I'm only wondering if it would also be >> possible to have access to the undecorated one (via `coerce` or >> `ndarray.__array_function__` or otherwise). >> -- Marten >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Wed Jun 6 06:20:03 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Wed, 6 Jun 2018 12:20:03 +0200 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On 6. Jun 2018 at 05:41, Nathan Goldbaum wrote: Oh wait, since the decorated version of the ufunc will be the one in the public numpy API it won't break. It would only break if the callable that was passed in *wasn't* the decorated version, so it kinda *has* to pass in the decorated function to preserve backward compatibility. Apologies for the noise. On Tue, Jun 5, 2018 at 7:39 PM, Nathan Goldbaum wrote: > Hmm, does this mean the callable that gets passed into __array_ufunc__ > will change? I'm pretty sure that will break the dispatch mechanism I'm > using in my __array_ufunc__ implementation, which directly checks whether > the callable is in one of several tuples of functions that have different > behavior. > Section ?Non-Goals? states that Ufuncs will not be part of this protocol, __array_ufunc__ will be used to override those as usual. Sent from Astro for Mac -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 6 14:06:48 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 6 Jun 2018 12:06:48 -0600 Subject: [Numpy-discussion] NumPy 1.14.4 released. Message-ID: Hi All, On behalf of the NumPy team, I am pleased to announce the release of NumPy 1.14.4. This is a bugfix release for bugs reported following the 1.14.3 release. The most significant fixes are: * fixes for compiler instruction reordering that resulted in NaN's not being properly propagated in `np.max` and `np.min`, * fixes for bus faults on SPARC and older ARM due to incorrect alignment checks. There are also improvements to printing of long doubles on PPC platforms. All is not yet perfect on that platform, the whitespace padding is still incorrect and is to be fixed in numpy 1.15, consequently NumPy still fails some printing-related (and other) unit tests on ppc systems. However, the printed values are now correct. Note that NumPy will error on import if it detects incorrect float32 `dot` results. This problem has been seen on the Mac when working in the Anaconda enviroment and is due to a subtle interaction between MKL and PyQt5. It is not strictly a NumPy problem, but it is best that users be aware of it. See the gh-8577 NumPy issue for more information. The Python versions supported in this release are 2.7 and 3.4 - 3.6. Wheels for all supported versions are available from PIP and source releases are available on github . The source releases were cythonized with Cython 0.28.2 and should be compatible with the upcoming Python 3.7. Contributors ============ A total of 7 people contributed to this release. People with a "+" by their names contributed a patch for the first time. * Allan Haldane * Charles Harris * Marten van Kerkwijk * Matti Picus * Pauli Virtanen * Ryan Soklaski + * Sebastian Berg Pull requests merged ==================== A total of 11 pull requests were merged for this release. * #11104: BUG: str of DOUBLE_DOUBLE format wrong on ppc64 * #11170: TST: linalg: add regression test for gh-8577 * #11174: MAINT: add sanity-checks to be run at import time * #11181: BUG: void dtype setup checked offset not actual pointer for alignment * #11194: BUG: Python2 doubles don't print correctly in interactive shell. * #11198: BUG: optimizing compilers can reorder call to npy_get_floatstatus * #11199: BUG: reduce using SSE only warns if inside SSE loop * #11203: BUG: Bytes delimiter/comments in genfromtxt should be decoded Cheers, Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 8 11:57:18 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 8 Jun 2018 11:57:18 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: Hi Stephan, I think we're getting to the stage where an updated text would be useful. For that, you may want to consider an actual implementation of, e.g., a very simple function like `np.reshape` as well as a more complicated one like `np.concatenate`, and in particular how the implementation finds out where its own instances are located. ?All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jun 8 12:39:49 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 8 Jun 2018 09:39:49 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: On Fri, Jun 8, 2018 at 8:58 AM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > I think we're getting to the stage where an updated text would be useful. > Yes, I plan to work on this over the weekend. Stay tuned! > For that, you may want to consider an actual implementation of, e.g., a > very simple function like `np.reshape` as well as a more complicated one > like `np.concatenate` > Yes, I agree that actual implementation (in Python rather than C for now) would be useful. > and in particular how the implementation finds out where its own instances > are located. > I think we've discussed this before, but I don't think this is feasible to solve in general given the diversity of wrapped APIs. If you want to find the arguments in which a class' own instances appear, you will need to do that in your overloaded function. That said, if merely pulling out the flat list of arguments that are checked for and/or implement __array_function__ would be enough, we can probably figure out a way to expose that information. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 8 19:49:13 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 8 Jun 2018 19:49:13 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: > and in particular how the implementation finds out where its own instances >> are located. >> > > I think we've discussed this before, but I don't think this is feasible to > solve in general given the diversity of wrapped APIs. If you want to find > the arguments in which a class' own instances appear, you will need to do > that in your overloaded function. > > That said, if merely pulling out the flat list of arguments that are > checked for and/or implement __array_function__ would be enough, we can > probably figure out a way to expose that information. > In the end, somewhere inside the "dance", you are checking for `__array_function` - it would seem to me that at that point you know exactly where you are, and it would not be difficult to something like ``` types[new_type] += [where_i_am] ``` (where here I assume types is a defaultdict(list)) - has the set of types in keys and locations as values. But easier to discuss whether this is easy with some sample code to look at! -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jun 8 20:10:13 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 8 Jun 2018 17:10:13 -0700 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: (offlist) To clarify, by "where_i_am" you mean something like the name of the argument where it was found? On Fri, Jun 8, 2018 at 4:49 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > and in particular how the implementation finds out where its own instances >>> are located. >>> >> >> I think we've discussed this before, but I don't think this is feasible >> to solve in general given the diversity of wrapped APIs. If you want to >> find the arguments in which a class' own instances appear, you will need to >> do that in your overloaded function. >> >> That said, if merely pulling out the flat list of arguments that are >> checked for and/or implement __array_function__ would be enough, we can >> probably figure out a way to expose that information. >> > > In the end, somewhere inside the "dance", you are checking for > `__array_function` - it would seem to me that at that point you know > exactly where you are, and it would not be difficult to something like > ``` > types[new_type] += [where_i_am] > ``` > (where here I assume types is a defaultdict(list)) - has the set of types > in keys and locations as values. > > But easier to discuss whether this is easy with some sample code to look > at! > > -- Marten > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 8 21:51:07 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 8 Jun 2018 21:51:07 -0400 Subject: [Numpy-discussion] =?utf-8?q?NEP=3A_Dispatch_Mechanism_for_NumPy?= =?utf-8?q?=E2=80=99s_high_level_API?= In-Reply-To: References: Message-ID: I meant whatever the state of the dance routine is, e.g., the way the arguments are enumerated by the decorator ?(this is partially why some example code for the dance routine is needed -- I am not 100% how this should work, just seems logical that if the dance routine can understand it, so can __array_function__ implementations). -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 10 12:27:32 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 10 Jun 2018 12:27:32 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: OK, I spent my Sunday morning writing a NEP. I hope this can lead to some closure... See https://github.com/numpy/numpy/pull/11297 -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sun Jun 10 19:02:35 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sun, 10 Jun 2018 16:02:35 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Rendered here: https://github.com/mhvk/numpy/blob/nep-gufunc-signature-enhancement/doc/neps/nep-0020-gufunc-signature-enhancement.rst Eric On Sun, 10 Jun 2018 at 09:37 Marten van Kerkwijk wrote: > OK, I spent my Sunday morning writing a NEP. I hope this can lead to some > closure... > See https://github.com/numpy/numpy/pull/11297 > -- Marten > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sun Jun 10 19:31:41 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sun, 10 Jun 2018 16:31:41 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Thanks for the writeup Marten, Nathaniel: Output shape feels very similar to output dtype to me, so maybe the general way to handle this would be to make the first callback take the input shapes+dtypes and return the desired output shapes+dtypes? This hits on an interesting alternative to frozen dimensions - np.cross could just become a regular ufunc with signature np.dtype((float64, 3)), np.dtype((float64, 3)) → np.dtype((float64, 3)) Furthermore, the expansion quickly becomes cumbersome. For instance, for the all_equal signature of (n|1),(n|1)->() ? I think this is only a good argument when used in conjunction with the broadcasting syntax. I don?t think it?s a reason for matmul not to have multiple signatures. Having multiple signatures is an disincentive to introduced too many overloads of the same function, which seems like a good thing to me Summarizing my overall opinions: - I?m +0.5 on frozen dimensions. The use-cases seem reasonable, and it seems like an easy-ish way to get them. Allowing ufuncs to natively support subarray types might be a tidier solution, but that could come down the road - I?m -1 on optional dimensions: they seem to legitimize creating many overloads of gufuncs. I?m already not a fan of how matmul has special cases for lower dimensions that don?t generalize well. To me, the best way to handle matmul would be to use the proposed __array_function__ to handle the shape-based special-case dispatching, either by: - Inserting dimensions, and calling the true gufunc np.linalg.matmul_2d (which is a function I?d like direct access to anyway). - Dispatching to one of four ufuncs - Broadcasting dimensions: - I know you?re not suggesting this but: enabling broadcasting unconditionally for all gufuncs would be a bad idea, masking linalg bugs. (although einsum does support broadcasting?) - Does it really need a per-dimension flag, rather than a global one? Can you give a case where that?s useful? - If we?d already made all_equal a gufunc, I?d be +1 on adding broadcasting support to it - I?m -0.5 on the all_equal path in the first place. I think we either should have a more generic approach to combined ufuncs, or just declare them numbas job. - Can you come up with a broadcasting use-case that isn?t just chaining a reduction with a broadcasting ufunc? Eric On Sun, 10 Jun 2018 at 16:02 Eric Wieser wrote: Rendered here: > https://github.com/mhvk/numpy/blob/nep-gufunc-signature-enhancement/doc/neps/nep-0020-gufunc-signature-enhancement.rst > > > Eric > > On Sun, 10 Jun 2018 at 09:37 Marten van Kerkwijk < > m.h.vankerkwijk at gmail.com> wrote: > >> OK, I spent my Sunday morning writing a NEP. I hope this can lead to some >> closure... >> See https://github.com/numpy/numpy/pull/11297 >> -- Marten >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Jun 10 20:26:35 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 17:26:35 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern wrote: > On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers > wrote: > >> It may be worth having a look at test suites for scipy, statsmodels, >> scikit-learn, etc. and estimate how much work this NEP causes those >> projects. If the devs of those packages are forced to do large scale >> migrations from RandomState to StableState, then why not instead keep >> RandomState and just add a new API next to it? >> > > The problem is that we can't really have an ecosystem with two different > general purpose systems. > Can't = prefer not to. But yes, that's true. That's not what I was saying though. We want one generic one, and one meant for unit testing only. You can achieve that in two ways: 1. Change the current np.random API to new generic, and add a new RandomStable for unit tests. 2. Add a new generic API, and document the current np.random API as being meant for unit tests only, for other usage should be preferred. (2) has a couple of pros: - you're not forcing almost every library and end user out there to migrate their unit tests. - more design freedom for the new generic API. The current one is clearly sub-optimal; in a new one you wouldn't have to expose all the global state/functions that np.random exposes now. You could even restrict it to a single class and put that in the main numpy namespace. Ralf To properly use pseudorandom numbers, I need to instantiate a PRNG and > thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > > Generating test data for unit tests is separable, though. That's why I > propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > > But that does suggest another alternative that we should explore: > > The new architecture separates the core uniform PRNG from the wide variety > of non-uniform probability distributions. That is, the core PRNG state is > encapsulated in a discrete object that can be shared between instances of > different distribution-providing classes. numpy.random should provide two > such distribution-providing classes. The main one (let us call it > ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > > ``LegacyGenerator`` CAN be used during this deprecation period in library > and application code until libraries and applications can migrate to the > new ``Generator``. Libraries and applications SHOULD migrate but MUST NOT > be forced to. ``LegacyGenerator`` CAN be used to generate test data for > unit tests where cross-release stability of the streams is important. Test > writers SHOULD consider ways to mitigate their reliance on such stability > and SHOULD limit their usage to distribution methods that have fewer > cross-platform stability risks. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Jun 10 20:46:36 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 17:46:36 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser wrote: > > > On Sun, Jun 3, 2018 at 11:20 PM, Ralf Gommers > wrote: > >> >> >> On Sun, Jun 3, 2018 at 6:54 PM, wrote: >> >>> >>> >>> On Sun, Jun 3, 2018 at 9:08 PM, Robert Kern >>> wrote: >>> >>>> On Sun, Jun 3, 2018 at 5:46 PM wrote: >>>> >>>>> >>>>> >>>>> On Sun, Jun 3, 2018 at 8:21 PM, Robert Kern >>>>> wrote: >>>>> >>>>>> >>>>>> The list of ``StableRandom`` methods should be chosen to support unit >>>>>>> tests: >>>>>>> >>>>>>> * ``.randint()`` >>>>>>> * ``.uniform()`` >>>>>>> * ``.normal()`` >>>>>>> * ``.standard_normal()`` >>>>>>> * ``.choice()`` >>>>>>> * ``.shuffle()`` >>>>>>> * ``.permutation()`` >>>>>>> >>>>>> >>>>>> https://github.com/numpy/numpy/pull/11229#discussion_r192604311 >>>>>> @bashtage writes: >>>>>> > standard_gamma and standard_exponential are important enough to be >>>>>> included here IMO. >>>>>> >>>>>> "Importance" was not my criterion, only whether they are used in unit >>>>>> test suites. This list was just off the top of my head for methods that I >>>>>> think were actually used in test suites, so I'd be happy to be shown live >>>>>> tests that use other methods. I'd like to be a *little* conservative about >>>>>> what methods we stick in here, but we don't have to be *too* conservative, >>>>>> since we are explicitly never going to be modifying these. >>>>>> >>>>> >>>>> That's one area where I thought the selection is too narrow. >>>>> We should be able to get a stable stream from the uniform for some >>>>> distributions. >>>>> >>>>> However, according to the Wikipedia description Poisson doesn't look >>>>> easy. I just wrote a unit test for statsmodels using Poisson random numbers >>>>> with hard coded numbers for the regression tests. >>>>> >>>> >>>> I'd really rather people do this than use StableRandom; this is best >>>> practice, as I see it, if your tests involve making precise comparisons to >>>> expected results. >>>> >>> >>> I hardcoded the results not the random data. So the unit tests rely on a >>> reproducible stream of Poisson random numbers. >>> I don't want to save 500 (100 or 1000) observations in a csv file for >>> every variation of the unit test that I run. >>> >> >> I agree, hardcoding numbers in every place where seeded random numbers >> are now used is quite unrealistic. >> >> It may be worth having a look at test suites for scipy, statsmodels, >> scikit-learn, etc. and estimate how much work this NEP causes those >> projects. If the devs of those packages are forced to do large scale >> migrations from RandomState to StableState, then why not instead keep >> RandomState and just add a new API next to it? >> >> > > As a quick and imperfect test, I monkey-patched numpy so that a call to > numpy.random.seed(m) actually uses m+1000 as the seed. I ran the tests > using the `runtests.py` script: > > *seed+1000, using 'python runtests.py -n' in the source directory:* > > 236 failed, 12881 passed, 1248 skipped, 585 deselected, 84 xfailed, 7 > xpassed > > > Most of the failures are in scipy.stats: > > *seed+1000, using 'python runtests.py -n -s stats' in the source > directory:* > > 203 failed, 1034 passed, 4 skipped, 370 deselected, 4 xfailed, 1 xpassed > > > Changing the amount added to the seed or running the tests using the > function `scipy.test("full")` gives different (but similar magnitude) > results: > > *seed+1000, using 'import scipy; scipy.test("full")' in an ipython shell:* > > 269 failed, 13359 passed, 1271 skipped, 134 xfailed, 8 xpassed > > *seed+1, using 'python runtests.py -n' in the source directory:* > > 305 failed, 12812 passed, 1248 skipped, 585 deselected, 84 xfailed, 7 > xpassed > > > I suspect many of the tests will be easy to update, so fixing 300 or so > tests does not seem like a monumental task. > It's all not monumental, but it adds up quickly. In addition to changing tests, one will also need compatibility code when supporting multiple numpy versions (e.g. scipy when get a copy of RandomStable in scipy/_lib/_numpy_compat.py). A quick count of just np.random.seed occurrences with ``$ grep -roh --include \*.py np.random.seed . | wc -w`` for some packages: numpy: 77 scipy: 462 matplotlib: 204 statsmodels: 461 pymc3: 36 scikit-image: 63 scikit-learn: 69 keras: 46 pytorch: 0 tensorflow: 368 astropy: 24 And note, these are *not* incorrect/broken usages, this is code that works and has done so for years. Conclusion: the current proposal will cause work for the vast majority of libraries that depends on numpy. The total amount of that work will certainly not be counted in person-days/weeks, and more likely in years than months. So I'm not convinced yet that the current proposal is the best way forward. Ralf I haven't looked into why there are 585 deselected tests; maybe there are > many more tests lurking there that will have to be updated. > > Warren > > > > Ralf >> >> >> >>> >>> >>>> >>>> StableRandom is intended as a crutch so that the pain of moving >>>> existing unit tests away from the deprecated RandomState is less onerous. >>>> I'd really rather people write better unit tests! >>>> >>>> In particular, I do not want to add any of the integer-domain >>>> distributions (aside from shuffle/permutation/choice) as these are the ones >>>> that have the platform-dependency issues with respect to 32/64-bit `long` >>>> integers. They'd be unreliable for unit tests even if we kept them stable >>>> over time. >>>> >>>> >>>>> I'm not sure which other distributions are common enough and not >>>>> easily reproducible by transformation. E.g. negative binomial can be >>>>> reproduces by a gamma-poisson mixture. >>>>> >>>>> On the other hand normal can be easily recreated from standard_normal. >>>>> >>>> >>>> I was mostly motivated by making it a bit easier to mechanically >>>> replace uses of randn(), which is probably even more common than normal() >>>> and standard_normal() in unit tests. >>>> >>>> >>>>> Would it be difficult to keep this list large, given that it should be >>>>> frozen, low maintenance code ? >>>>> >>>> >>>> I admit that I had in mind non-statistical unit tests. That is, tests >>>> that didn't depend on the precise distribution of the inputs. >>>> >>> >>> The problem is that the unit test in `stats` rely on precise inputs (up >>> to some numerical noise). >>> For example p-values themselves are uniformly distributed if the >>> hypothesis test works correctly. That mean if I don't have control over the >>> inputs, then my p-value could be anything in (0, 1). So either we need a >>> real dataset, save all the random numbers in a file or have a reproducible >>> set of random numbers. >>> >>> 95% of the unit tests that I write are for statistics. A large fraction >>> of them don't rely on the exact distribution, but do rely on a random >>> numbers that are "good enough". >>> For example, when writing unit test, then I get every once in a while or >>> sometimes more often a "bad" stream of random numbers, for which >>> convergence might fail or where the estimated numbers are far away from the >>> true numbers, so test tolerance would have to be very high. >>> If I pick one of the seeds that looks good, then I can have tighter unit >>> test tolerance to insure results are good in a nice case. >>> >>> The problem is that we cannot write robust unit tests for regression >>> tests without stable inputs. >>> E.g. I verified my results with a Monte Carlo with 5000 replications and >>> 1000 Poisson observations in each. >>> Results look close to expected and won't depend much on the exact stream >>> of random variables. >>> But the Monte Carlo for each variant of the test took about 40 seconds. >>> Doing this for all option combination and dataset specification takes too >>> long to be feasible in a unit test suite. >>> So I rely on numpy's stable random numbers and hard code the results for >>> a specific random sample in the regression unit tests. >>> >>> Josef >>> >>> >>> >>>> >>>> -- >>>> Robert Kern >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion at python.org >>>> https://mail.python.org/mailman/listinfo/numpy-discussion >>>> >>>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 10 20:52:50 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Jun 2018 17:52:50 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: In Sun, Jun 10, 2018 at 4:31 PM Eric Wieser wrote: > Thanks for the writeup Marten, > Indeed, thank you Marten! > This hits on an interesting alternative to frozen dimensions - np.cross > could just become a regular ufunc with signature np.dtype((float64, 3)), > np.dtype((float64, 3)) → np.dtype((float64, 3)) > > Another alternative to mention is returning multiple arrays, e.g., two arrays for a fixed dimension of size 2. That said, I still think frozen dimension are a better proposal than either of these. > - I?m -1 on optional dimensions: they seem to legitimize creating many > overloads of gufuncs. I?m already not a fan of how matmul has special cases > for lower dimensions that don?t generalize well. To me, the best way to > handle matmul would be to use the proposed __array_function__ to > handle the shape-based special-case dispatching, either by: > - Inserting dimensions, and calling the true gufunc > np.linalg.matmul_2d (which is a function I?d like direct access to > anyway). > - Dispatching to one of four ufuncs > > I don't understand your alternative here. If we overload np.matmul using __array_function__, then it would not use *ether* of these options for writing the operation in terms of other gufuncs. It would simply look for an __array_function__ attribute, and call that method instead. My concern with either inserting dimensions or dispatching to one of four ufuncs is that some objects (e.g., xarray.DataArray) define matrix multiplication, but in an incompatible way with NumPy (e.g., xarray sums over axes with the same name, instead of last / second-to-last axes). NumPy really ought to provide a way overload the either operation, without either inserting/removing dummy dimensions or inspecting input shapes to dispatch to other gufuncs. That said, if you don't want to make np.matmul a gufunc, then I would much rather use Python's standard overloading rules with __matmul__/__rmatmul__ than use __array_function__, for two reasons: 1. You *already* need to use __matmul__/__rmatmul__ if you want to support matrix multiplication with @ on your class, so __array_function__ would be additional and redundant. __array_function__ is really intended as a fall-back, for cases where there is no other alternative. 2. With the current __array_function__ proposal, this would imply that calling other unimplemented NumPy functions on your object would raise TypeError rather than doing coercion. This sort of additional coupled behavior is probably not what an implementor of operator.matmul/@ is looking for. In summary, I would either support: 1. (This proposal) Adding additional optional dimensions to gufuncs for np.matmul/operator.matmul, or 2. Making operator.matmul a special case for mathematical operators that always checks overloads with __matmul__/__rmatmul__ even if __array_ufunc__ is defined. Either way, matrix-multiplication becomes somewhat of a special case. It's just a matter of whether it's a special case for gufuncs (using optional dimensions) or a special case for arithmetic overloads in NumPy (not using __array_ufunc__). Given that I think optional dimensions have other conceivable uses in gufuncs (for row/column vectors), I think that's the better option. I would not support either expand dimensions or dispatch to multiple gufuncs in NumPy's implementation of operator.matmul (i.e., ndarray.__matmul__). We could potentially only do this for numpy.matmul rather than operator.matmul/@, but that opens the door to potential inconsistency between the NumPy version of an operator and Python's version of an operator, which is something we tried very hard to avoid with __arary_ufunc__. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 10 20:57:24 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 10 Jun 2018 17:57:24 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers wrote: > > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser < warren.weckesser at gmail.com> wrote: >> I suspect many of the tests will be easy to update, so fixing 300 or so tests does not seem like a monumental task. > > It's all not monumental, but it adds up quickly. In addition to changing tests, one will also need compatibility code when supporting multiple numpy versions (e.g. scipy when get a copy of RandomStable in scipy/_lib/_numpy_compat.py). > > A quick count of just np.random.seed occurrences with ``$ grep -roh --include \*.py np.random.seed . | wc -w`` for some packages: > numpy: 77 > scipy: 462 > matplotlib: 204 > statsmodels: 461 > pymc3: 36 > scikit-image: 63 > scikit-learn: 69 > keras: 46 > pytorch: 0 > tensorflow: 368 > astropy: 24 > > And note, these are *not* incorrect/broken usages, this is code that works and has done so for years. Yes, some of them are incorrect and broken. Failure can be difficult to detect. This module from keras is particularly problematic: https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py > Conclusion: the current proposal will cause work for the vast majority of libraries that depends on numpy. The total amount of that work will certainly not be counted in person-days/weeks, and more likely in years than months. So I'm not convinced yet that the current proposal is the best way forward. The mere usage of np.random.seed() doesn't imply that these packages actually require stream-compatibility. Some might, for sure, like where they are used in the unit tests, but that's not what you counted. At best, these numbers just mean that we can't eliminate np.random.seed() in a new system right away. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 10 21:08:47 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 10 Jun 2018 18:08:47 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers wrote: > > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern wrote: >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers wrote: >>> >>> It may be worth having a look at test suites for scipy, statsmodels, scikit-learn, etc. and estimate how much work this NEP causes those projects. If the devs of those packages are forced to do large scale migrations from RandomState to StableState, then why not instead keep RandomState and just add a new API next to it? >> >> The problem is that we can't really have an ecosystem with two different general purpose systems. > > Can't = prefer not to. I meant what I wrote. :-) > But yes, that's true. That's not what I was saying though. We want one generic one, and one meant for unit testing only. You can achieve that in two ways: > 1. Change the current np.random API to new generic, and add a new RandomStable for unit tests. > 2. Add a new generic API, and document the current np.random API as being meant for unit tests only, for other usage should be preferred. > > (2) has a couple of pros: > - you're not forcing almost every library and end user out there to migrate their unit tests. But it has the cons that I talked about. RandomState *is* a fully functional general purpose PRNG system. After all, that's its current use. Documenting it as intended to be something else will not change that fact. Documentation alone provides no real impetus to move to the new system outside of the unit tests. And the community does need to move together to the new system in their library code, or else we won't be able to combine libraries together; these PRNG objects need to thread all the way through between code from different authors if we are to write programs with a controlled seed. The failure mode when people don't pay attention to the documentation is that I can no longer write programs that compose these libraries together. That's why I wrote "can't". It's not a mere preference for not having two systems to maintain. It has binary Go/No Go implications for building reproducible programs. > - more design freedom for the new generic API. The current one is clearly sub-optimal; in a new one you wouldn't have to expose all the global state/functions that np.random exposes now. You could even restrict it to a single class and put that in the main numpy namespace. I'm not sure why you are talking about the global state and np.random.* convenience functions. What we do with those functions is out of scope for this NEP and would be talked about it another NEP fully introducing the new system. >> To properly use pseudorandom numbers, I need to instantiate a PRNG and thread it through all of the code in my program: both the parts that I write and the third party libraries that I don't write. >> >> Generating test data for unit tests is separable, though. That's why I propose having a StableRandom built on the new architecture. Its purpose would be well-documented, and in my proposal is limited in features such that it will be less likely to be abused outside of that purpose. If you make it fully-featured, it is more likely to be abused by building library code around it. But even if it is so abused, because it is built on the new architecture, at least I can thread the same core PRNG state through the StableRandom distributions from the abusing library and use the better distributions class elsewhere (randomgen names it "Generator"). Just keeping RandomState around can't work like that because it doesn't have a replaceable core PRNG. >> >> But that does suggest another alternative that we should explore: >> >> The new architecture separates the core uniform PRNG from the wide variety of non-uniform probability distributions. That is, the core PRNG state is encapsulated in a discrete object that can be shared between instances of different distribution-providing classes. numpy.random should provide two such distribution-providing classes. The main one (let us call it ``Generator``, as it is called in the prototype) will follow the new policy: distribution methods can break the stream in feature releases. There will also be a secondary distributions class (let us call it ``LegacyGenerator``) which contains distribution methods exactly as they exist in the current ``RandomState`` implementation. When one combines ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the exact same stream as ``RandomState`` for all distribution methods. The ``LegacyGenerator`` methods will be forever frozen. ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with the MT19937 core PRNG, and whatever tricks needed to make ``isinstance(prng, RandomState)`` and unpickling work should be done. This way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be deprecated, becoming progressively noisier over a number of release cycles, in favor of explicitly instantiating ``LegacyGenerator``. >> >> ``LegacyGenerator`` CAN be used during this deprecation period in library and application code until libraries and applications can migrate to the new ``Generator``. Libraries and applications SHOULD migrate but MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test data for unit tests where cross-release stability of the streams is important. Test writers SHOULD consider ways to mitigate their reliance on such stability and SHOULD limit their usage to distribution methods that have fewer cross-platform stability risks. I would appreciate your consideration of this proposal. Does it address your concerns? It addresses my concerns with keeping around a fully-functional RandomState implementation. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sun Jun 10 22:45:11 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 10 Jun 2018 22:45:11 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 9:08 PM, Robert Kern wrote: > On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers > wrote: > > > > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern > wrote: > >> > >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers > wrote: > >>> > >>> It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > >> > >> The problem is that we can't really have an ecosystem with two > different general purpose systems. > > > > Can't = prefer not to. > > I meant what I wrote. :-) > > > But yes, that's true. That's not what I was saying though. We want one > generic one, and one meant for unit testing only. You can achieve that in > two ways: > > 1. Change the current np.random API to new generic, and add a new > RandomStable for unit tests. > > 2. Add a new generic API, and document the current np.random API as > being meant for unit tests only, for other usage should be > preferred. > > > > (2) has a couple of pros: > > - you're not forcing almost every library and end user out there to > migrate their unit tests. > > But it has the cons that I talked about. RandomState *is* a fully > functional general purpose PRNG system. After all, that's its current use. > Documenting it as intended to be something else will not change that fact. > Documentation alone provides no real impetus to move to the new system > outside of the unit tests. And the community does need to move together to > the new system in their library code, or else we won't be able to combine > libraries together; these PRNG objects need to thread all the way through > between code from different authors if we are to write programs with a > controlled seed. The failure mode when people don't pay attention to the > documentation is that I can no longer write programs that compose these > libraries together. That's why I wrote "can't". It's not a mere preference > for not having two systems to maintain. It has binary Go/No Go implications > for building reproducible programs. > I don't understand this part. For example, scipy.stats and scikit-learn allow the user to provide a RandomState instance to the functions. I don't see why you want to force down stream libraries to change this. A random state argument should be (essentially) compatible with whatever the user uses, and there is no reason to force packages to update there internal use like in unit tests if they don't want to, e.g. because of the instability. Aside to statsmodels: We currently have very few user facing random functions, those are just in maybe 3 to 5 places where we have simulated or bootstrap values. Most of the other uses of np.random are in unit tests and some in the documentation examples. Josef > > > - more design freedom for the new generic API. The current one is > clearly sub-optimal; in a new one you wouldn't have to expose all the > global state/functions that np.random exposes now. You could even restrict > it to a single class and put that in the main numpy namespace. > > I'm not sure why you are talking about the global state and np.random.* > convenience functions. What we do with those functions is out of scope for > this NEP and would be talked about it another NEP fully introducing the new > system. > > >> To properly use pseudorandom numbers, I need to instantiate a PRNG and > thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > >> > >> Generating test data for unit tests is separable, though. That's why I > propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > >> > >> But that does suggest another alternative that we should explore: > >> > >> The new architecture separates the core uniform PRNG from the wide > variety of non-uniform probability distributions. That is, the core PRNG > state is encapsulated in a discrete object that can be shared between > instances of different distribution-providing classes. numpy.random should > provide two such distribution-providing classes. The main one (let us call > it ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > >> > >> ``LegacyGenerator`` CAN be used during this deprecation period in > library and application code until libraries and applications can migrate > to the new ``Generator``. Libraries and applications SHOULD migrate but > MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test > data for unit tests where cross-release stability of the streams is > important. Test writers SHOULD consider ways to mitigate their reliance on > such stability and SHOULD limit their usage to distribution methods that > have fewer cross-platform stability risks. > > I would appreciate your consideration of this proposal. Does it address > your concerns? It addresses my concerns with keeping around a > fully-functional RandomState implementation. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Jun 10 23:01:20 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 20:01:20 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern wrote: > On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers > wrote: > > > > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern > wrote: > >> > >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers > wrote: > >>> > >>> It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > >> > >> The problem is that we can't really have an ecosystem with two > different general purpose systems. > > > > Can't = prefer not to. > > I meant what I wrote. :-) > > > But yes, that's true. That's not what I was saying though. We want one > generic one, and one meant for unit testing only. You can achieve that in > two ways: > > 1. Change the current np.random API to new generic, and add a new > RandomStable for unit tests. > > 2. Add a new generic API, and document the current np.random API as > being meant for unit tests only, for other usage should be > preferred. > > > > (2) has a couple of pros: > > - you're not forcing almost every library and end user out there to > migrate their unit tests. > > But it has the cons that I talked about. RandomState *is* a fully > functional general purpose PRNG system. After all, that's its current use. > Documenting it as intended to be something else will not change that fact. > Documentation alone provides no real impetus to move to the new system > outside of the unit tests. And the community does need to move together to > the new system in their library code, or else we won't be able to combine > libraries together; these PRNG objects need to thread all the way through > between code from different authors if we are to write programs with a > controlled seed. The failure mode when people don't pay attention to the > documentation is that I can no longer write programs that compose these > libraries together. That's why I wrote "can't". It's not a mere preference > for not having two systems to maintain. It has binary Go/No Go implications > for building reproducible programs. > I strongly suspect you are right, but only because you're asserting "can't" so heavily. I have trouble formulating what would go wrong in case there's two PRNGs used in a single program. It's not described in the NEP, nor in the numpy.random docs (those don't even have any recommendations for best practices listed as far as I can tell - that needs fixing). All you explain in the NEP is that reproducible research isn't helped by the current stream-compat guarantee. So a bit of (probably incorrect) devil's advocate reasoning: - If there's no stream-compat guarantee, all a user can rely on is the properties of drawing from a seeded PRNG. - Any use of a PRNG in library code can also only rely on properties - So now whether in a user's program libraries draw from one or two seeded PRNGs doesn't matter for reproducibility, because those properties don't change. Also, if there is to be a multi-year transitioning to the new API, would there be two PRNG systems anyway during those years? > > - more design freedom for the new generic API. The current one is > clearly sub-optimal; in a new one you wouldn't have to expose all the > global state/functions that np.random exposes now. You could even restrict > it to a single class and put that in the main numpy namespace. > > I'm not sure why you are talking about the global state and np.random.* > convenience functions. What we do with those functions is out of scope for > this NEP and would be talked about it another NEP fully introducing the new > system. > To quote you from one of the first emails in this thread: " I deliberately left it out of this one as it may, depending on our choices, impinge upon the design of the new PRNG subsystem, which I declared out of scope for this NEP. I have ideas (besides the glib "Let them eat AttributeErrors!"), and now that I think more about it, that does seem like it might be in scope just like the discussion of freezing RandomState and StableRandom are. But I think I'd like to hold that thought a little bit and get a little more screaming^Wfeedback on the core proposal first. I'll return to this in a few days if not sooner. " So consider this some screaming^Wfeedback:) > > >> To properly use pseudorandom numbers, I need to instantiate a PRNG and > thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > >> > >> Generating test data for unit tests is separable, though. That's why I > propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > >> > >> But that does suggest another alternative that we should explore: > >> > >> The new architecture separates the core uniform PRNG from the wide > variety of non-uniform probability distributions. That is, the core PRNG > state is encapsulated in a discrete object that can be shared between > instances of different distribution-providing classes. numpy.random should > provide two such distribution-providing classes. The main one (let us call > it ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > >> > >> ``LegacyGenerator`` CAN be used during this deprecation period in > library and application code until libraries and applications can migrate > to the new ``Generator``. Libraries and applications SHOULD migrate but > MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test > data for unit tests where cross-release stability of the streams is > important. Test writers SHOULD consider ways to mitigate their reliance on > such stability and SHOULD limit their usage to distribution methods that > have fewer cross-platform stability risks. > > I would appreciate your consideration of this proposal. Does it address > your concerns? It addresses my concerns with keeping around a > fully-functional RandomState implementation. > My concerns are: 1. The amount of work caused by making libraries and end users migrate. 2. That this is a backwards compatibility break, which will cause problems for users who relied on the old guarantees (the arguments in the NEP that the old guarantees weren't 100% watertight don't mean that backcompat doesn't matter at all). As far as I can tell, this new proposal doesn't deal with those concerns directly. What it does seem to do is making transitioning a bit easier for users that were already using RandomState instances. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sun Jun 10 23:10:16 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 20:10:16 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern wrote: > On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers > wrote: > > > > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser < > warren.weckesser at gmail.com> wrote: > > >> I suspect many of the tests will be easy to update, so fixing 300 or so > tests does not seem like a monumental task. > > > > It's all not monumental, but it adds up quickly. In addition to changing > tests, one will also need compatibility code when supporting multiple numpy > versions (e.g. scipy when get a copy of RandomStable in > scipy/_lib/_numpy_compat.py). > > > > A quick count of just np.random.seed occurrences with ``$ grep -roh > --include \*.py np.random.seed . | wc -w`` for some packages: > > numpy: 77 > > scipy: 462 > > matplotlib: 204 > > statsmodels: 461 > > pymc3: 36 > > scikit-image: 63 > > scikit-learn: 69 > > keras: 46 > > pytorch: 0 > > tensorflow: 368 > > astropy: 24 > > > > And note, these are *not* incorrect/broken usages, this is code that > works and has done so for years. > > Yes, some of them are incorrect and broken. Failure can be difficult to > detect. This module from keras is particularly problematic: > > https://github.com/keras-team/keras-preprocessing/blob/ > master/keras_preprocessing/image.py > You have to appreciate that we're not all thinking at lightning speed and in the same direction. If there is a difficult to detect problem, it may be useful to give a brief code example (or even line of reasoning) of how this actually breaks something. > > > Conclusion: the current proposal will cause work for the vast majority > of libraries that depends on numpy. The total amount of that work will > certainly not be counted in person-days/weeks, and more likely in years > than months. So I'm not convinced yet that the current proposal is the best > way forward. > > The mere usage of np.random.seed() doesn't imply that these packages > actually require stream-compatibility. Some might, for sure, like where > they are used in the unit tests, but that's not what you counted. At best, > these numbers just mean that we can't eliminate np.random.seed() in a new > system right away. > Well, mere usage has been called an antipattern (also on your behalf), plus for scipy over half of the usages do give test failures (Warren's quick test). So I'd say that counting usages is a decent proxy for the work that has to be done. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jun 10 23:38:50 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Jun 2018 20:38:50 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 8:10 PM Ralf Gommers wrote: > On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern > wrote: > >> > Conclusion: the current proposal will cause work for the vast majority >> of libraries that depends on numpy. The total amount of that work will >> certainly not be counted in person-days/weeks, and more likely in years >> than months. So I'm not convinced yet that the current proposal is the best >> way forward. >> > >> The mere usage of np.random.seed() doesn't imply that these packages >> actually require stream-compatibility. Some might, for sure, like where >> they are used in the unit tests, but that's not what you counted. At best, >> these numbers just mean that we can't eliminate np.random.seed() in a new >> system right away. >> > > Well, mere usage has been called an antipattern (also on your behalf), > plus for scipy over half of the usages do give test failures (Warren's > quick test). So I'd say that counting usages is a decent proxy for the work > that has to be done. > Let me suggest another possible concession for backwards compatibility. We should make a dedicated module, e.g., "numpy.random.stable" that contains functions implemented as methods on StableRandom. These functions should include "seed", which is too pervasive to justify removing. Transitioning to the new module should be as simple as mechanistically replacing all uses of "numpy.random" with "numpy.random.stable". This module would add virtually no maintenance overhead, because the implementations would be entirely contained on StableRandom, and would simply involve creating a single top-level StableRandom object (like what is currently done in numpy.random). -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 11 01:06:11 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 10 Jun 2018 22:06:11 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 7:46 PM wrote: > > On Sun, Jun 10, 2018 at 9:08 PM, Robert Kern wrote: >> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers wrote: >> > >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern wrote: >> >> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers wrote: >> >>> >> >>> It may be worth having a look at test suites for scipy, statsmodels, scikit-learn, etc. and estimate how much work this NEP causes those projects. If the devs of those packages are forced to do large scale migrations from RandomState to StableState, then why not instead keep RandomState and just add a new API next to it? >> >> >> >> The problem is that we can't really have an ecosystem with two different general purpose systems. >> > >> > Can't = prefer not to. >> >> I meant what I wrote. :-) >> >> > But yes, that's true. That's not what I was saying though. We want one generic one, and one meant for unit testing only. You can achieve that in two ways: >> > 1. Change the current np.random API to new generic, and add a new RandomStable for unit tests. >> > 2. Add a new generic API, and document the current np.random API as being meant for unit tests only, for other usage should be preferred. >> > >> > (2) has a couple of pros: >> > - you're not forcing almost every library and end user out there to migrate their unit tests. >> >> But it has the cons that I talked about. RandomState *is* a fully functional general purpose PRNG system. After all, that's its current use. Documenting it as intended to be something else will not change that fact. Documentation alone provides no real impetus to move to the new system outside of the unit tests. And the community does need to move together to the new system in their library code, or else we won't be able to combine libraries together; these PRNG objects need to thread all the way through between code from different authors if we are to write programs with a controlled seed. The failure mode when people don't pay attention to the documentation is that I can no longer write programs that compose these libraries together. That's why I wrote "can't". It's not a mere preference for not having two systems to maintain. It has binary Go/No Go implications for building reproducible programs. > > I don't understand this part. > For example, scipy.stats and scikit-learn allow the user to provide a RandomState instance to the functions. I don't see why you want to force down stream libraries to change this. A random state argument should be (essentially) compatible with whatever the user uses, and there is no reason to force packages to update there internal use like in unit tests if they don't want to, e.g. because of the instability. > > Aside to statsmodels: We currently have very few user facing random functions, those are just in maybe 3 to 5 places where we have simulated or bootstrap values. > Most of the other uses of np.random are in unit tests and some in the documentation examples. Please consider my alternative proposal. Your feedback has convinced me that that's a better approach than the StableRandom as laid out in the NEP. I'm even willing to not deprecate the name RandomState. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 11 01:36:29 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 10 Jun 2018 22:36:29 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern wrote: >> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers wrote: >> > >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern wrote: >> >> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers wrote: >> >>> >> >>> It may be worth having a look at test suites for scipy, statsmodels, scikit-learn, etc. and estimate how much work this NEP causes those projects. If the devs of those packages are forced to do large scale migrations from RandomState to StableState, then why not instead keep RandomState and just add a new API next to it? >> >> >> >> The problem is that we can't really have an ecosystem with two different general purpose systems. >> > >> > Can't = prefer not to. >> >> I meant what I wrote. :-) >> >> > But yes, that's true. That's not what I was saying though. We want one generic one, and one meant for unit testing only. You can achieve that in two ways: >> > 1. Change the current np.random API to new generic, and add a new RandomStable for unit tests. >> > 2. Add a new generic API, and document the current np.random API as being meant for unit tests only, for other usage should be preferred. >> > >> > (2) has a couple of pros: >> > - you're not forcing almost every library and end user out there to migrate their unit tests. >> >> But it has the cons that I talked about. RandomState *is* a fully functional general purpose PRNG system. After all, that's its current use. Documenting it as intended to be something else will not change that fact. Documentation alone provides no real impetus to move to the new system outside of the unit tests. And the community does need to move together to the new system in their library code, or else we won't be able to combine libraries together; these PRNG objects need to thread all the way through between code from different authors if we are to write programs with a controlled seed. The failure mode when people don't pay attention to the documentation is that I can no longer write programs that compose these libraries together. That's why I wrote "can't". It's not a mere preference for not having two systems to maintain. It has binary Go/No Go implications for building reproducible programs. > > I strongly suspect you are right, but only because you're asserting "can't" so heavily. I have trouble formulating what would go wrong in case there's two PRNGs used in a single program. It's not described in the NEP, nor in the numpy.random docs (those don't even have any recommendations for best practices listed as far as I can tell - that needs fixing). All you explain in the NEP is that reproducible research isn't helped by the current stream-compat guarantee. So a bit of (probably incorrect) devil's advocate reasoning: > - If there's no stream-compat guarantee, all a user can rely on is the properties of drawing from a seeded PRNG. > - Any use of a PRNG in library code can also only rely on properties > - So now whether in a user's program libraries draw from one or two seeded PRNGs doesn't matter for reproducibility, because those properties don't change. Correctly making a stochastic program reproducible while retaining good statistical properties is difficult. People don't do it well in the best of circumstances. The best way that we've found to manage that difficulty is to instantiate a single stream and use it all throughout your code. Every new stream requires the management of more seeds (unless if we use the fancy new algorithms that have settable stream IDs, but by stipulation, we don't have these in this case). And now I have to thread both of these objects through my code, and pass the right object to each third-party library. These third-party libraries don't know anything about this weird 2-stream workaround that you are doing, so we now have libraries that can't build on each other unless if they are using the same compatible API, even if I can make workarounds to build a program that combines two libraries side-to-side. So yeah, people "can" do this. "It's just a matter of code" as my boss likes to say. But it's making an already-difficult task more difficult. > Also, if there is to be a multi-year transitioning to the new API, would there be two PRNG systems anyway during those years? Sure, but with a deadline and not-just-documentation to motivate transitioning. But if we follow my alternative proposal, there'll be no need for deprecation! You've convinced me to not deprecate RandomState. I just want to change some of its internal implementation details, add a less-stable set of distributions on the side, and a framework of core uniform PRNGs that can be shared by both. >> > - more design freedom for the new generic API. The current one is clearly sub-optimal; in a new one you wouldn't have to expose all the global state/functions that np.random exposes now. You could even restrict it to a single class and put that in the main numpy namespace. >> >> I'm not sure why you are talking about the global state and np.random.* convenience functions. What we do with those functions is out of scope for this NEP and would be talked about it another NEP fully introducing the new system. > > To quote you from one of the first emails in this thread: " > I deliberately left it out of this one as it may, depending on our choices, impinge upon the design of the new PRNG subsystem, which I declared out of scope for this NEP. I have ideas (besides the glib "Let them eat AttributeErrors!"), and now that I think more about it, that does seem like it might be in scope just like the discussion of freezing RandomState and StableRandom are. But I think I'd like to hold that thought a little bit and get a little more screaming^Wfeedback on the core proposal first. I'll return to this in a few days if not sooner. > " > > So consider this some screaming^Wfeedback:) Ahem. Yes, I just remembered I said that. :-) But still, there will be lots of options about what to do with np.random.*, whatever proposal we go with. It doesn't really impose constraints on the core proposals. >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG and thread it through all of the code in my program: both the parts that I write and the third party libraries that I don't write. >> >> >> >> Generating test data for unit tests is separable, though. That's why I propose having a StableRandom built on the new architecture. Its purpose would be well-documented, and in my proposal is limited in features such that it will be less likely to be abused outside of that purpose. If you make it fully-featured, it is more likely to be abused by building library code around it. But even if it is so abused, because it is built on the new architecture, at least I can thread the same core PRNG state through the StableRandom distributions from the abusing library and use the better distributions class elsewhere (randomgen names it "Generator"). Just keeping RandomState around can't work like that because it doesn't have a replaceable core PRNG. >> >> >> >> But that does suggest another alternative that we should explore: >> >> >> >> The new architecture separates the core uniform PRNG from the wide variety of non-uniform probability distributions. That is, the core PRNG state is encapsulated in a discrete object that can be shared between instances of different distribution-providing classes. numpy.random should provide two such distribution-providing classes. The main one (let us call it ``Generator``, as it is called in the prototype) will follow the new policy: distribution methods can break the stream in feature releases. There will also be a secondary distributions class (let us call it ``LegacyGenerator``) which contains distribution methods exactly as they exist in the current ``RandomState`` implementation. When one combines ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the exact same stream as ``RandomState`` for all distribution methods. The ``LegacyGenerator`` methods will be forever frozen. ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with the MT19937 core PRNG, and whatever tricks needed to make ``isinstance(prng, RandomState)`` and unpickling work should be done. This way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be deprecated, becoming progressively noisier over a number of release cycles, in favor of explicitly instantiating ``LegacyGenerator``. >> >> >> >> ``LegacyGenerator`` CAN be used during this deprecation period in library and application code until libraries and applications can migrate to the new ``Generator``. Libraries and applications SHOULD migrate but MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test data for unit tests where cross-release stability of the streams is important. Test writers SHOULD consider ways to mitigate their reliance on such stability and SHOULD limit their usage to distribution methods that have fewer cross-platform stability risks. >> >> I would appreciate your consideration of this proposal. Does it address your concerns? It addresses my concerns with keeping around a fully-functional RandomState implementation. > > My concerns are: > 1. The amount of work caused by making libraries and end users migrate. > 2. That this is a backwards compatibility break, which will cause problems for users who relied on the old guarantees (the arguments in the NEP that the old guarantees weren't 100% watertight don't mean that backcompat doesn't matter at all). > > As far as I can tell, this new proposal doesn't deal with those concerns directly. What it does seem to do is making transitioning a bit easier for users that were already using RandomState instances. Let me drop the deprecation of the name RandomState. RandomState(int_seed) will forever and always create a backwards- and stream-compatible object. No one will have to migrate. How does that strike you? -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 11 02:15:44 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sun, 10 Jun 2018 23:15:44 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern wrote: >> >> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers wrote: >> > >> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser < warren.weckesser at gmail.com> wrote: >> >> >> I suspect many of the tests will be easy to update, so fixing 300 or so tests does not seem like a monumental task. >> > >> > It's all not monumental, but it adds up quickly. In addition to changing tests, one will also need compatibility code when supporting multiple numpy versions (e.g. scipy when get a copy of RandomStable in scipy/_lib/_numpy_compat.py). >> > >> > A quick count of just np.random.seed occurrences with ``$ grep -roh --include \*.py np.random.seed . | wc -w`` for some packages: >> > numpy: 77 >> > scipy: 462 >> > matplotlib: 204 >> > statsmodels: 461 >> > pymc3: 36 >> > scikit-image: 63 >> > scikit-learn: 69 >> > keras: 46 >> > pytorch: 0 >> > tensorflow: 368 >> > astropy: 24 >> > >> > And note, these are *not* incorrect/broken usages, this is code that works and has done so for years. >> >> Yes, some of them are incorrect and broken. Failure can be difficult to detect. This module from keras is particularly problematic: >> >> https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py > > You have to appreciate that we're not all thinking at lightning speed and in the same direction. If there is a difficult to detect problem, it may be useful to give a brief code example (or even line of reasoning) of how this actually breaks something. Ahem. Sorry. That wasn't the code I was thinking of. It's merely hazardous, not broken by itself. However, if you used any of the `seed=` arguments that are helpfully(?) provided, you are almost certainly writing broken code. If you must use np.random.seed() to get reproducibility, you need to call it exactly once at the start of your code (or maybe once for each process) and let it ride. This is the impossible-to-use-correctly code that I was thinking of, which got partially fixed after I pointed out the problem. https://github.com/keras-team/keras/pull/8325/files The intention of this code is to shuffle two same-length sequences in the same way. So now if I write my code well to call np.random.seed() once at the start of my program, this function comes along and obliterates that with a fixed seed just so it can reuse the seed again to replicate the shuffle. Puzzlingly, the root sin of unconditionally and unavoidably reseeding for some of these functions is still there even though I showed how and why to avoid it. This is one reason why I was skeptical that merely documenting RandomState or StableRandom to only be used for unit tests would work. :-) >> > Conclusion: the current proposal will cause work for the vast majority of libraries that depends on numpy. The total amount of that work will certainly not be counted in person-days/weeks, and more likely in years than months. So I'm not convinced yet that the current proposal is the best way forward. >> >> The mere usage of np.random.seed() doesn't imply that these packages actually require stream-compatibility. Some might, for sure, like where they are used in the unit tests, but that's not what you counted. At best, these numbers just mean that we can't eliminate np.random.seed() in a new system right away. > > Well, mere usage has been called an antipattern (also on your behalf), plus for scipy over half of the usages do give test failures (Warren's quick test). So I'd say that counting usages is a decent proxy for the work that has to be done. Sure. But with my new proposal, we don't have to change it (as much as I'd like to!). I'll draft up a PR to modify my NEP accordingly. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Mon Jun 11 02:43:33 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 23:43:33 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern wrote: > On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers > wrote: > > > > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern > wrote: > >> > >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers > wrote: > >> > > >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern > wrote: > >> >> > >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers > wrote: > >> >>> > >> >>> It may be worth having a look at test suites for scipy, > statsmodels, scikit-learn, etc. and estimate how much work this NEP causes > those projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > >> >> > >> >> The problem is that we can't really have an ecosystem with two > different general purpose systems. > >> > > >> > Can't = prefer not to. > >> > >> I meant what I wrote. :-) > >> > >> > But yes, that's true. That's not what I was saying though. We want > one generic one, and one meant for unit testing only. You can achieve that > in two ways: > >> > 1. Change the current np.random API to new generic, and add a new > RandomStable for unit tests. > >> > 2. Add a new generic API, and document the current np.random API as > being meant for unit tests only, for other usage should be > preferred. > >> > > >> > (2) has a couple of pros: > >> > - you're not forcing almost every library and end user out there to > migrate their unit tests. > >> > >> But it has the cons that I talked about. RandomState *is* a fully > functional general purpose PRNG system. After all, that's its current use. > Documenting it as intended to be something else will not change that fact. > Documentation alone provides no real impetus to move to the new system > outside of the unit tests. And the community does need to move together to > the new system in their library code, or else we won't be able to combine > libraries together; these PRNG objects need to thread all the way through > between code from different authors if we are to write programs with a > controlled seed. The failure mode when people don't pay attention to the > documentation is that I can no longer write programs that compose these > libraries together. That's why I wrote "can't". It's not a mere preference > for not having two systems to maintain. It has binary Go/No Go implications > for building reproducible programs. > > > > I strongly suspect you are right, but only because you're asserting > "can't" so heavily. I have trouble formulating what would go wrong in case > there's two PRNGs used in a single program. It's not described in the NEP, > nor in the numpy.random docs (those don't even have any recommendations for > best practices listed as far as I can tell - that needs fixing). All you > explain in the NEP is that reproducible research isn't helped by the > current stream-compat guarantee. So a bit of (probably incorrect) devil's > advocate reasoning: > > - If there's no stream-compat guarantee, all a user can rely on is the > properties of drawing from a seeded PRNG. > > - Any use of a PRNG in library code can also only rely on properties > > - So now whether in a user's program libraries draw from one or two > seeded PRNGs doesn't matter for reproducibility, because those properties > don't change. > > Correctly making a stochastic program reproducible while retaining good > statistical properties is difficult. People don't do it well in the best of > circumstances. The best way that we've found to manage that difficulty is > to instantiate a single stream and use it all throughout your code. Every > new stream requires the management of more seeds (unless if we use the > fancy new algorithms that have settable stream IDs, but by stipulation, we > don't have these in this case). And now I have to thread both of these > objects through my code, and pass the right object to each third-party > library. These third-party libraries don't know anything about this weird > 2-stream workaround that you are doing, so we now have libraries that can't > build on each other unless if they are using the same compatible API, even > if I can make workarounds to build a program that combines two libraries > side-to-side. > > So yeah, people "can" do this. "It's just a matter of code" as my boss > likes to say. But it's making an already-difficult task more difficult. > Okay, that makes more sense to me now. It would be really useful to document such best practices and rationales. Note that scipy.stats distributions allow passing in either a RandomState instance or an integer as seed (which will be used for seeding a new instance, not for np.random.seed) [1]. That seems like a fine design pattern as well, and passing on a seed that way is fairly easy and as good for reproducibility as passing in a single PRNG. [1] https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612 > > Also, if there is to be a multi-year transitioning to the new API, would > there be two PRNG systems anyway during those years? > > Sure, but with a deadline and not-just-documentation to motivate > transitioning. > > But if we follow my alternative proposal, there'll be no need for > deprecation! You've convinced me to not deprecate RandomState. > That's not how I had read it, but great to hear that! I just want to change some of its internal implementation details, add a > less-stable set of distributions on the side, and a framework of core > uniform PRNGs that can be shared by both. > > >> > - more design freedom for the new generic API. The current one is > clearly sub-optimal; in a new one you wouldn't have to expose all the > global state/functions that np.random exposes now. You could even restrict > it to a single class and put that in the main numpy namespace. > >> > >> I'm not sure why you are talking about the global state and np.random.* > convenience functions. What we do with those functions is out of scope for > this NEP and would be talked about it another NEP fully introducing the new > system. > > > > To quote you from one of the first emails in this thread: " > > I deliberately left it out of this one as it may, depending on our > choices, impinge upon the design of the new PRNG subsystem, which I > declared out of scope for this NEP. I have ideas (besides the glib "Let > them eat AttributeErrors!"), and now that I think more about it, that does > seem like it might be in scope just like the discussion of freezing > RandomState and StableRandom are. But I think I'd like to hold that thought > a little bit and get a little more screaming^Wfeedback on the core proposal > first. I'll return to this in a few days if not sooner. > > " > > > > So consider this some screaming^Wfeedback:) > > Ahem. Yes, I just remembered I said that. :-) But still, there will be > lots of options about what to do with np.random.*, whatever proposal we go > with. It doesn't really impose constraints on the core proposals. > > >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG > and thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > >> >> > >> >> Generating test data for unit tests is separable, though. That's why > I propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > >> >> > >> >> But that does suggest another alternative that we should explore: > >> >> > >> >> The new architecture separates the core uniform PRNG from the wide > variety of non-uniform probability distributions. That is, the core PRNG > state is encapsulated in a discrete object that can be shared between > instances of different distribution-providing classes. numpy.random should > provide two such distribution-providing classes. The main one (let us call > it ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > >> >> > >> >> ``LegacyGenerator`` CAN be used during this deprecation period in > library and application code until libraries and applications can migrate > to the new ``Generator``. Libraries and applications SHOULD migrate but > MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test > data for unit tests where cross-release stability of the streams is > important. Test writers SHOULD consider ways to mitigate their reliance on > such stability and SHOULD limit their usage to distribution methods that > have fewer cross-platform stability risks. > >> > >> I would appreciate your consideration of this proposal. Does it address > your concerns? It addresses my concerns with keeping around a > fully-functional RandomState implementation. > > > > My concerns are: > > 1. The amount of work caused by making libraries and end users migrate. > > 2. That this is a backwards compatibility break, which will cause > problems for users who relied on the old guarantees (the arguments in the > NEP that the old guarantees weren't 100% watertight don't mean that > backcompat doesn't matter at all). > > > > As far as I can tell, this new proposal doesn't deal with those concerns > directly. What it does seem to do is making transitioning a bit easier for > users that were already using RandomState instances. > > Let me drop the deprecation of the name RandomState. RandomState(int_seed) > will forever and always create a backwards- and stream-compatible object. > No one will have to migrate. > > How does that strike you? > Sounds good. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Mon Jun 11 02:53:07 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sun, 10 Jun 2018 23:53:07 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern wrote: > On Sun, Jun 10, 2018 at 8:11 PM Ralf Gommers > wrote: > > > > On Sun, Jun 10, 2018 at 5:57 PM, Robert Kern > wrote: > >> > >> On Sun, Jun 10, 2018 at 5:47 PM Ralf Gommers > wrote: > >> > > >> > On Sun, Jun 3, 2018 at 9:23 PM, Warren Weckesser < > warren.weckesser at gmail.com> wrote: > >> > >> >> I suspect many of the tests will be easy to update, so fixing 300 or > so tests does not seem like a monumental task. > >> > > >> > It's all not monumental, but it adds up quickly. In addition to > changing tests, one will also need compatibility code when supporting > multiple numpy versions (e.g. scipy when get a copy of RandomStable in > scipy/_lib/_numpy_compat.py). > >> > > >> > A quick count of just np.random.seed occurrences with ``$ grep -roh > --include \*.py np.random.seed . | wc -w`` for some packages: > >> > numpy: 77 > >> > scipy: 462 > >> > matplotlib: 204 > >> > statsmodels: 461 > >> > pymc3: 36 > >> > scikit-image: 63 > >> > scikit-learn: 69 > >> > keras: 46 > >> > pytorch: 0 > >> > tensorflow: 368 > >> > astropy: 24 > >> > > >> > And note, these are *not* incorrect/broken usages, this is code that > works and has done so for years. > >> > >> Yes, some of them are incorrect and broken. Failure can be difficult to > detect. This module from keras is particularly problematic: > >> > >> https://github.com/keras-team/keras-preprocessing/blob/ > master/keras_preprocessing/image.py > > > > You have to appreciate that we're not all thinking at lightning speed > and in the same direction. If there is a difficult to detect problem, it > may be useful to give a brief code example (or even line of reasoning) of > how this actually breaks something. > > Ahem. Sorry. That wasn't the code I was thinking of. It's merely > hazardous, not broken by itself. However, if you used any of the `seed=` > arguments that are helpfully(?) provided, you are almost certainly writing > broken code. If you must use np.random.seed() to get reproducibility, you > need to call it exactly once at the start of your code (or maybe once for > each process) and let it ride. > > This is the impossible-to-use-correctly code that I was thinking of, which > got partially fixed after I pointed out the problem. > > https://github.com/keras-team/keras/pull/8325/files > > The intention of this code is to shuffle two same-length sequences in the > same way. So now if I write my code well to call np.random.seed() once at > the start of my program, this function comes along and obliterates that > with a fixed seed just so it can reuse the seed again to replicate the > shuffle. > Yes, that's a big no-no. There are situations conceivable where a library has to set a seed, but I think the right pattern in that case would be something like old_state = np.random.get_state() np.random.seed(some_int) do_stuff() np.random.set_state(**old._state) > Puzzlingly, the root sin of unconditionally and unavoidably reseeding for > some of these functions is still there even though I showed how and why to > avoid it. This is one reason why I was skeptical that merely documenting > RandomState or StableRandom to only be used for unit tests would work. :-) > Well, no matter what we do, I'm sure that there'll be lots of people who will still get it wrong:) > >> > Conclusion: the current proposal will cause work for the vast > majority of libraries that depends on numpy. The total amount of that work > will certainly not be counted in person-days/weeks, and more likely in > years than months. So I'm not convinced yet that the current proposal is > the best way forward. > >> > >> The mere usage of np.random.seed() doesn't imply that these packages > actually require stream-compatibility. Some might, for sure, like where > they are used in the unit tests, but that's not what you counted. At best, > these numbers just mean that we can't eliminate np.random.seed() in a new > system right away. > > > > Well, mere usage has been called an antipattern (also on your behalf), > plus for scipy over half of the usages do give test failures (Warren's > quick test). So I'd say that counting usages is a decent proxy for the work > that has to be done. > > Sure. But with my new proposal, we don't have to change it (as much as I'd > like to!). I'll draft up a PR to modify my NEP accordingly. > Sounds good! Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevin.k.sheppard at gmail.com Mon Jun 11 03:02:54 2018 From: kevin.k.sheppard at gmail.com (Kevin Sheppard) Date: Mon, 11 Jun 2018 08:02:54 +0100 Subject: [Numpy-discussion] NEP: Random Number Generator Policy (Robert Kern) In-Reply-To: References: Message-ID: <5b1e1e9e.1c69fb81.7976a.7aed@mx.google.com> Maybe a good place for a stable, testing focused generator would be in numpy.random.testing. This could host a default implementation of StableGenerator, although a better name might be TestingGenerator. It would also help users decide that this is not the generator they are looking for (I think many people might think StableGenerator is a good thing, after all, who wants an UnstableGenerator). -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 11 03:29:33 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 11 Jun 2018 00:29:33 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 11:44 PM Ralf Gommers wrote: > Note that scipy.stats distributions allow passing in either a RandomState instance or an integer as seed (which will be used for seeding a new instance, not for np.random.seed) [1]. That seems like a fine design pattern as well, and passing on a seed that way is fairly easy and as good for reproducibility as passing in a single PRNG. > > [1] https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612 Well, carefully. You wouldn't want to pass on the same integer seed to multiple functions. Accepting an integer seed is super-convenient at the command line/notebooks, though, or docstrings or in tests or other situations where your "reproducibility horizon" is small. These utilities are good for scaling from these small use cases to up to large ones. scikit-learn is also a good example of good PRNG hygiene: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L715 >> > Also, if there is to be a multi-year transitioning to the new API, would there be two PRNG systems anyway during those years? >> >> Sure, but with a deadline and not-just-documentation to motivate transitioning. >> >> But if we follow my alternative proposal, there'll be no need for deprecation! You've convinced me to not deprecate RandomState. > > That's not how I had read it, but great to hear that! Indeed, I did deprecate the name RandomState in that drafting, but it's not really necessary, and you've convinced me that we shouldn't do it. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Mon Jun 11 03:33:08 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 11 Jun 2018 00:33:08 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 11:54 PM Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern wrote: >> Puzzlingly, the root sin of unconditionally and unavoidably reseeding for some of these functions is still there even though I showed how and why to avoid it. This is one reason why I was skeptical that merely documenting RandomState or StableRandom to only be used for unit tests would work. :-) > > Well, no matter what we do, I'm sure that there'll be lots of people who will still get it wrong:) Exactly! This is why I objected to leaving RandomState completely alone and just documenting it for use to generate test data. Inevitably, people will "get it wrong", so we need to design in anticipation of these failure modes and provide ways to work around them. >> Sure. But with my new proposal, we don't have to change it (as much as I'd like to!). I'll draft up a PR to modify my NEP accordingly. > > Sounds good! Thanks! Your and Josef's feedback on these points has been very helpful. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Jun 11 03:45:39 2018 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 11 Jun 2018 00:45:39 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sun, Jun 10, 2018 at 11:53 PM, Ralf Gommers wrote: > > On Sun, Jun 10, 2018 at 11:15 PM, Robert Kern wrote: >> >> The intention of this code is to shuffle two same-length sequences in the >> same way. So now if I write my code well to call np.random.seed() once at >> the start of my program, this function comes along and obliterates that with >> a fixed seed just so it can reuse the seed again to replicate the shuffle. > > > Yes, that's a big no-no. There are situations conceivable where a library > has to set a seed, but I think the right pattern in that case would be > something like > > old_state = np.random.get_state() > np.random.seed(some_int) > do_stuff() > np.random.set_state(**old._state) This will seem to work fine in testing, and then when someone tries to use your library in a multithreaded program everything will break in complicated and subtle ways :-(. I really don't think there's any conceivable situation where a library (as opposed to an application) can correctly use the global random state. -n -- Nathaniel J. Smith -- https://vorpus.org From josef.pktd at gmail.com Mon Jun 11 10:26:04 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 11 Jun 2018 10:26:04 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 11, 2018 at 2:43 AM, Ralf Gommers wrote: > > > On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern > wrote: > >> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers >> wrote: >> > >> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern >> wrote: >> >> >> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers >> wrote: >> >> > >> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern >> wrote: >> >> >> >> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers >> wrote: >> >> >>> >> >> >>> It may be worth having a look at test suites for scipy, >> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes >> those projects. If the devs of those packages are forced to do large scale >> migrations from RandomState to StableState, then why not instead keep >> RandomState and just add a new API next to it? >> >> >> >> >> >> The problem is that we can't really have an ecosystem with two >> different general purpose systems. >> >> > >> >> > Can't = prefer not to. >> >> >> >> I meant what I wrote. :-) >> >> >> >> > But yes, that's true. That's not what I was saying though. We want >> one generic one, and one meant for unit testing only. You can achieve that >> in two ways: >> >> > 1. Change the current np.random API to new generic, and add a new >> RandomStable for unit tests. >> >> > 2. Add a new generic API, and document the current np.random API as >> being meant for unit tests only, for other usage should be >> preferred. >> >> > >> >> > (2) has a couple of pros: >> >> > - you're not forcing almost every library and end user out there to >> migrate their unit tests. >> >> >> >> But it has the cons that I talked about. RandomState *is* a fully >> functional general purpose PRNG system. After all, that's its current use. >> Documenting it as intended to be something else will not change that fact. >> Documentation alone provides no real impetus to move to the new system >> outside of the unit tests. And the community does need to move together to >> the new system in their library code, or else we won't be able to combine >> libraries together; these PRNG objects need to thread all the way through >> between code from different authors if we are to write programs with a >> controlled seed. The failure mode when people don't pay attention to the >> documentation is that I can no longer write programs that compose these >> libraries together. That's why I wrote "can't". It's not a mere preference >> for not having two systems to maintain. It has binary Go/No Go implications >> for building reproducible programs. >> > >> > I strongly suspect you are right, but only because you're asserting >> "can't" so heavily. I have trouble formulating what would go wrong in case >> there's two PRNGs used in a single program. It's not described in the NEP, >> nor in the numpy.random docs (those don't even have any recommendations for >> best practices listed as far as I can tell - that needs fixing). All you >> explain in the NEP is that reproducible research isn't helped by the >> current stream-compat guarantee. So a bit of (probably incorrect) devil's >> advocate reasoning: >> > - If there's no stream-compat guarantee, all a user can rely on is the >> properties of drawing from a seeded PRNG. >> > - Any use of a PRNG in library code can also only rely on properties >> > - So now whether in a user's program libraries draw from one or two >> seeded PRNGs doesn't matter for reproducibility, because those properties >> don't change. >> >> Correctly making a stochastic program reproducible while retaining good >> statistical properties is difficult. People don't do it well in the best of >> circumstances. The best way that we've found to manage that difficulty is >> to instantiate a single stream and use it all throughout your code. Every >> new stream requires the management of more seeds (unless if we use the >> fancy new algorithms that have settable stream IDs, but by stipulation, we >> don't have these in this case). And now I have to thread both of these >> objects through my code, and pass the right object to each third-party >> library. These third-party libraries don't know anything about this weird >> 2-stream workaround that you are doing, so we now have libraries that can't >> build on each other unless if they are using the same compatible API, even >> if I can make workarounds to build a program that combines two libraries >> side-to-side. >> >> So yeah, people "can" do this. "It's just a matter of code" as my boss >> likes to say. But it's making an already-difficult task more difficult. >> > > Okay, that makes more sense to me now. It would be really useful to > document such best practices and rationales. > > Note that scipy.stats distributions allow passing in either a RandomState > instance or an integer as seed (which will be used for seeding a new > instance, not for np.random.seed) [1]. That seems like a fine design > pattern as well, and passing on a seed that way is fairly easy and as good > for reproducibility as passing in a single PRNG. > > [1] https://github.com/scipy/scipy/blob/master/scipy/stats/ > _distn_infrastructure.py#L612 > > >> > Also, if there is to be a multi-year transitioning to the new API, >> would there be two PRNG systems anyway during those years? >> >> Sure, but with a deadline and not-just-documentation to motivate >> transitioning. >> >> But if we follow my alternative proposal, there'll be no need for >> deprecation! You've convinced me to not deprecate RandomState. >> > > That's not how I had read it, but great to hear that! > > I just want to change some of its internal implementation details, add a >> less-stable set of distributions on the side, and a framework of core >> uniform PRNGs that can be shared by both. >> >> >> > - more design freedom for the new generic API. The current one is >> clearly sub-optimal; in a new one you wouldn't have to expose all the >> global state/functions that np.random exposes now. You could even restrict >> it to a single class and put that in the main numpy namespace. >> >> >> >> I'm not sure why you are talking about the global state and >> np.random.* convenience functions. What we do with those functions is out >> of scope for this NEP and would be talked about it another NEP fully >> introducing the new system. >> > >> > To quote you from one of the first emails in this thread: " >> > I deliberately left it out of this one as it may, depending on our >> choices, impinge upon the design of the new PRNG subsystem, which I >> declared out of scope for this NEP. I have ideas (besides the glib "Let >> them eat AttributeErrors!"), and now that I think more about it, that does >> seem like it might be in scope just like the discussion of freezing >> RandomState and StableRandom are. But I think I'd like to hold that thought >> a little bit and get a little more screaming^Wfeedback on the core proposal >> first. I'll return to this in a few days if not sooner. >> > " >> > >> > So consider this some screaming^Wfeedback:) >> >> Ahem. Yes, I just remembered I said that. :-) But still, there will be >> lots of options about what to do with np.random.*, whatever proposal we go >> with. It doesn't really impose constraints on the core proposals. >> >> >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG >> and thread it through all of the code in my program: both the parts that I >> write and the third party libraries that I don't write. >> >> >> >> >> >> Generating test data for unit tests is separable, though. That's >> why I propose having a StableRandom built on the new architecture. Its >> purpose would be well-documented, and in my proposal is limited in features >> such that it will be less likely to be abused outside of that purpose. If >> you make it fully-featured, it is more likely to be abused by building >> library code around it. But even if it is so abused, because it is built on >> the new architecture, at least I can thread the same core PRNG state >> through the StableRandom distributions from the abusing library and use the >> better distributions class elsewhere (randomgen names it "Generator"). Just >> keeping RandomState around can't work like that because it doesn't have a >> replaceable core PRNG. >> >> >> >> >> >> But that does suggest another alternative that we should explore: >> >> >> >> >> >> The new architecture separates the core uniform PRNG from the wide >> variety of non-uniform probability distributions. That is, the core PRNG >> state is encapsulated in a discrete object that can be shared between >> instances of different distribution-providing classes. numpy.random should >> provide two such distribution-providing classes. The main one (let us call >> it ``Generator``, as it is called in the prototype) will follow the new >> policy: distribution methods can break the stream in feature releases. >> There will also be a secondary distributions class (let us call it >> ``LegacyGenerator``) which contains distribution methods exactly as they >> exist in the current ``RandomState`` implementation. When one combines >> ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the >> exact same stream as ``RandomState`` for all distribution methods. The >> ``LegacyGenerator`` methods will be forever frozen. >> ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with >> the MT19937 core PRNG, and whatever tricks needed to make >> ``isinstance(prng, RandomState)`` and unpickling work should be done. This >> way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be >> deprecated, becoming progressively noisier over a number of release cycles, >> in favor of explicitly instantiating ``LegacyGenerator``. >> >> >> >> >> >> ``LegacyGenerator`` CAN be used during this deprecation period in >> library and application code until libraries and applications can migrate >> to the new ``Generator``. Libraries and applications SHOULD migrate but >> MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test >> data for unit tests where cross-release stability of the streams is >> important. Test writers SHOULD consider ways to mitigate their reliance on >> such stability and SHOULD limit their usage to distribution methods that >> have fewer cross-platform stability risks. >> >> >> >> I would appreciate your consideration of this proposal. Does it >> address your concerns? It addresses my concerns with keeping around a >> fully-functional RandomState implementation. >> > >> > My concerns are: >> > 1. The amount of work caused by making libraries and end users migrate. >> > 2. That this is a backwards compatibility break, which will cause >> problems for users who relied on the old guarantees (the arguments in the >> NEP that the old guarantees weren't 100% watertight don't mean that >> backcompat doesn't matter at all). >> > >> > As far as I can tell, this new proposal doesn't deal with those >> concerns directly. What it does seem to do is making transitioning a bit >> easier for users that were already using RandomState instances. >> >> Let me drop the deprecation of the name RandomState. >> RandomState(int_seed) will forever and always create a backwards- and >> stream-compatible object. No one will have to migrate. >> >> How does that strike you? >> > > Sounds good. > I'm trying to catch up here but I'm not sure what the latest version of the proposal is. IMO we need a stable stream of random numbers for the various distribution forever. Talking about deprecation misses the point that we don't want to have to migrate our unit tests to a "non-stable" stream of random numbers. In terms of user API we need some instance of a random state or random generator that can be used with scikit-learn's check_random_state (which was copied to scipy and will be copied to statsmodels when we get around to it.) IMO naming or pure API changes are fine, with deprecation of the old "style", as long as the changes can be done mechanically, e.g. adding "legacy" somewhere in the names or options. E.g. for scikit-learn and scipy.stats it might be just a small change in the check_random_state function, but maybe more changes in the unit tests that actually use and create a Random stream. Implementation I don't know or didn't pay enough attention to the details. The proposal sounds now like separating the distribution rvs generation from the underlying random stream. I had thought that was already in the proposal. If I were writing this for statsmodels, then I would hand a `method` keyword around that defaults to `method=None` which uses the latest and greatest available method independent of backwards compatibility, and method='stable' or method='legacy' as alternative. And maybe some distribution specific methods. This is separate from the option which underlying PRNG method to use. IIUC, the choices in the proposal are now 3 combinations - legacy: MT19937 core PRNG + distribution_method='stable' - mixed: MT19937 core PRNG + distribution_method=None - new: ??? core PRNG + distribution_method=None where the second might be just a special case of the third option, so it reduces to binary choice. aside to > The best way that we've found to manage that difficulty is to instantiate a single stream and use it all throughout your code. First, with check_random_state option it's up to a user As a user I have cases where I the use cases are independent and it doesn't matter if it uses the same seed. As a user I have cases where I would prefer if two methods use the same random stream. (e.g. bootstrap confidence intervals computed with two different methods where I wouldn't want the difference to come from different random streams when I compare them.) Also as a user, in some cases I used two different RandomState instances to get random numbers for different parts of the simulation. (Example: I generate y and x for a regression simulation separately, so that when I increase the number of observations, the initial sample stays the same, i.e. is recreated each time.) Some of this might make it into library code when statsmodels gets more simulation and bootstrap methods. Josef > > Cheers, > Ralf > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Mon Jun 11 10:59:49 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 11 Jun 2018 10:59:49 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: > > Nathaniel: > > Output shape feels very similar to > output dtype to me, so maybe the general way to handle this would be > to make the first callback take the input shapes+dtypes and return the > desired output shapes+dtypes? > > This hits on an interesting alternative to frozen dimensions - np.cross > could just become a regular ufunc with signature np.dtype((float64, 3)), > np.dtype((float64, 3)) → np.dtype((float64, 3)) > As you note further down, the present proposal of just using numbers has the advantage of being clear and easy. Another (small?) advantage is that I can use `axis` to tell where my three coordinates are, rather than be stuck with having them as the last dimension. Indeed, in my trials for wrapping the Standards Of Fundamental Astronomy routines, I started with just making every 3-vector and 3x3-matrix structured arrays with the relevant single sub-array entry. That worked, but I ended up disliking the casting to and fro. > Furthermore, the expansion quickly becomes cumbersome. For instance, for > the all_equal signature of (n|1),(n|1)->() ? > > I think this is only a good argument when used in conjunction with the > broadcasting syntax. I don?t think it?s a reason for matmul not to have > multiple signatures. Having multiple signatures is an disincentive to > introduced too many overloads of the same function, which seems like a good > thing to me > But implementation for matmul is actually considerably trickier, since the internal loop now has to check the number of distinct dimensions. > Summarizing my overall opinions: > > - I?m +0.5 on frozen dimensions. The use-cases seem reasonable, and it > seems like an easy-ish way to get them. Allowing ufuncs to natively support > subarray types might be a tidier solution, but that could come down the road > > Indeed, they are not mutually exclusive. My guess would be that the use cases would be somewhat different. > > - I?m -1 on optional dimensions: they seem to legitimize creating many > overloads of gufuncs. I?m already not a fan of how matmul has special cases > for lower dimensions that don?t generalize well. To me, the best way to > handle matmul would be to use the proposed __array_function__ to > handle the shape-based special-case dispatching, either by: > - Inserting dimensions, and calling the true gufunc > np.linalg.matmul_2d (which is a function I?d like direct access to > anyway). > - Dispatching to one of four ufuncs > > I must admit I wish that `@` was just pure matrix multiplication... But otherwise agree with Stephan as optional dimensions being the least-bad solution. Aside: do agree we should think about how to expose the `linalg` gufuncs. > > - Broadcasting dimensions: > - I know you?re not suggesting this but: enabling broadcasting > unconditionally for all gufuncs would be a bad idea, masking linalg bugs. > (although einsum does support broadcasting?) > > Indeed, definitely *not* suggesting that! > > - > - Does it really need a per-dimension flag, rather than a global > one? Can you give a case where that?s useful? > > Mostly simply that the implementation is easier given the optional dimensions... Also, it has the benefit of being clear what the function can handle by inspection of the signature, i.e., it self-documents better (one of my main arguments in favour of frozen dimensions...). > > - > - If we?d already made all_equal a gufunc, I?d be +1 on adding > broadcasting support to it > - I?m -0.5 on the all_equal path in the first place. I think we > either should have a more generic approach to combined ufuncs, or just > declare them numbas job. > > I am working on and off on a way to generically chain ufuncs (goal would be to auto-create an inner loop that calls all the chained ufuncs loops in turn). Not sure that short-circuiting will be all that easy. I actually quite like the all_equal ufunc, but it is in part because I remember discovering how painfully slow (a==b).all() was (and still have a place where I would use it if it existed). And it does fit in the (admittedly vague) plans to try to make `.reduce` a gufunc. > > - > - Can you come up with a broadcasting use-case that isn?t just > chaining a reduction with a broadcasting ufunc? > > Perhaps the use is that it allows people to write gufuncs that are like such functions... Absent a mechanism to chain ufuncs, more complicated gufuncs are currently the easiest way to get fast more complicated algebra. But perhaps a putative weighted_mean(y, sigma) -> mean, sigma_mean is a decent example? Its signature would be (n),(n)->(),() but then you're forced to give individual sigmas for each point. With (n|1),(n|1)->(),() you are no longer forced to do that (though the case of all y being the same is less than useful here... I did at some point have an implementation that worked by core dimension of each argument, but ended up feeling it was not worth the extra complication) -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Mon Jun 11 11:00:52 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Mon, 11 Jun 2018 11:00:52 -0400 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Mon, Jun 11, 2018 at 10:26 AM, wrote: > > > On Mon, Jun 11, 2018 at 2:43 AM, Ralf Gommers > wrote: > >> >> >> On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern >> wrote: >> >>> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers >>> wrote: >>> > >>> > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern >>> wrote: >>> >> >>> >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers >>> wrote: >>> >> > >>> >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern >>> wrote: >>> >> >> >>> >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers < >>> ralf.gommers at gmail.com> wrote: >>> >> >>> >>> >> >>> It may be worth having a look at test suites for scipy, >>> statsmodels, scikit-learn, etc. and estimate how much work this NEP causes >>> those projects. If the devs of those packages are forced to do large scale >>> migrations from RandomState to StableState, then why not instead keep >>> RandomState and just add a new API next to it? >>> >> >> >>> >> >> The problem is that we can't really have an ecosystem with two >>> different general purpose systems. >>> >> > >>> >> > Can't = prefer not to. >>> >> >>> >> I meant what I wrote. :-) >>> >> >>> >> > But yes, that's true. That's not what I was saying though. We want >>> one generic one, and one meant for unit testing only. You can achieve that >>> in two ways: >>> >> > 1. Change the current np.random API to new generic, and add a new >>> RandomStable for unit tests. >>> >> > 2. Add a new generic API, and document the current np.random API as >>> being meant for unit tests only, for other usage should be >>> preferred. >>> >> > >>> >> > (2) has a couple of pros: >>> >> > - you're not forcing almost every library and end user out there to >>> migrate their unit tests. >>> >> >>> >> But it has the cons that I talked about. RandomState *is* a fully >>> functional general purpose PRNG system. After all, that's its current use. >>> Documenting it as intended to be something else will not change that fact. >>> Documentation alone provides no real impetus to move to the new system >>> outside of the unit tests. And the community does need to move together to >>> the new system in their library code, or else we won't be able to combine >>> libraries together; these PRNG objects need to thread all the way through >>> between code from different authors if we are to write programs with a >>> controlled seed. The failure mode when people don't pay attention to the >>> documentation is that I can no longer write programs that compose these >>> libraries together. That's why I wrote "can't". It's not a mere preference >>> for not having two systems to maintain. It has binary Go/No Go implications >>> for building reproducible programs. >>> > >>> > I strongly suspect you are right, but only because you're asserting >>> "can't" so heavily. I have trouble formulating what would go wrong in case >>> there's two PRNGs used in a single program. It's not described in the NEP, >>> nor in the numpy.random docs (those don't even have any recommendations for >>> best practices listed as far as I can tell - that needs fixing). All you >>> explain in the NEP is that reproducible research isn't helped by the >>> current stream-compat guarantee. So a bit of (probably incorrect) devil's >>> advocate reasoning: >>> > - If there's no stream-compat guarantee, all a user can rely on is the >>> properties of drawing from a seeded PRNG. >>> > - Any use of a PRNG in library code can also only rely on properties >>> > - So now whether in a user's program libraries draw from one or two >>> seeded PRNGs doesn't matter for reproducibility, because those properties >>> don't change. >>> >>> Correctly making a stochastic program reproducible while retaining good >>> statistical properties is difficult. People don't do it well in the best of >>> circumstances. The best way that we've found to manage that difficulty is >>> to instantiate a single stream and use it all throughout your code. Every >>> new stream requires the management of more seeds (unless if we use the >>> fancy new algorithms that have settable stream IDs, but by stipulation, we >>> don't have these in this case). And now I have to thread both of these >>> objects through my code, and pass the right object to each third-party >>> library. These third-party libraries don't know anything about this weird >>> 2-stream workaround that you are doing, so we now have libraries that can't >>> build on each other unless if they are using the same compatible API, even >>> if I can make workarounds to build a program that combines two libraries >>> side-to-side. >>> >>> So yeah, people "can" do this. "It's just a matter of code" as my boss >>> likes to say. But it's making an already-difficult task more difficult. >>> >> >> Okay, that makes more sense to me now. It would be really useful to >> document such best practices and rationales. >> >> Note that scipy.stats distributions allow passing in either a RandomState >> instance or an integer as seed (which will be used for seeding a new >> instance, not for np.random.seed) [1]. That seems like a fine design >> pattern as well, and passing on a seed that way is fairly easy and as good >> for reproducibility as passing in a single PRNG. >> >> [1] https://github.com/scipy/scipy/blob/master/scipy/stats/_ >> distn_infrastructure.py#L612 >> >> >>> > Also, if there is to be a multi-year transitioning to the new API, >>> would there be two PRNG systems anyway during those years? >>> >>> Sure, but with a deadline and not-just-documentation to motivate >>> transitioning. >>> >>> But if we follow my alternative proposal, there'll be no need for >>> deprecation! You've convinced me to not deprecate RandomState. >>> >> >> That's not how I had read it, but great to hear that! >> >> I just want to change some of its internal implementation details, add a >>> less-stable set of distributions on the side, and a framework of core >>> uniform PRNGs that can be shared by both. >>> >>> >> > - more design freedom for the new generic API. The current one is >>> clearly sub-optimal; in a new one you wouldn't have to expose all the >>> global state/functions that np.random exposes now. You could even restrict >>> it to a single class and put that in the main numpy namespace. >>> >> >>> >> I'm not sure why you are talking about the global state and >>> np.random.* convenience functions. What we do with those functions is out >>> of scope for this NEP and would be talked about it another NEP fully >>> introducing the new system. >>> > >>> > To quote you from one of the first emails in this thread: " >>> > I deliberately left it out of this one as it may, depending on our >>> choices, impinge upon the design of the new PRNG subsystem, which I >>> declared out of scope for this NEP. I have ideas (besides the glib "Let >>> them eat AttributeErrors!"), and now that I think more about it, that does >>> seem like it might be in scope just like the discussion of freezing >>> RandomState and StableRandom are. But I think I'd like to hold that thought >>> a little bit and get a little more screaming^Wfeedback on the core proposal >>> first. I'll return to this in a few days if not sooner. >>> > " >>> > >>> > So consider this some screaming^Wfeedback:) >>> >>> Ahem. Yes, I just remembered I said that. :-) But still, there will be >>> lots of options about what to do with np.random.*, whatever proposal we go >>> with. It doesn't really impose constraints on the core proposals. >>> >>> >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG >>> and thread it through all of the code in my program: both the parts that I >>> write and the third party libraries that I don't write. >>> >> >> >>> >> >> Generating test data for unit tests is separable, though. That's >>> why I propose having a StableRandom built on the new architecture. Its >>> purpose would be well-documented, and in my proposal is limited in features >>> such that it will be less likely to be abused outside of that purpose. If >>> you make it fully-featured, it is more likely to be abused by building >>> library code around it. But even if it is so abused, because it is built on >>> the new architecture, at least I can thread the same core PRNG state >>> through the StableRandom distributions from the abusing library and use the >>> better distributions class elsewhere (randomgen names it "Generator"). Just >>> keeping RandomState around can't work like that because it doesn't have a >>> replaceable core PRNG. >>> >> >> >>> >> >> But that does suggest another alternative that we should explore: >>> >> >> >>> >> >> The new architecture separates the core uniform PRNG from the wide >>> variety of non-uniform probability distributions. That is, the core PRNG >>> state is encapsulated in a discrete object that can be shared between >>> instances of different distribution-providing classes. numpy.random should >>> provide two such distribution-providing classes. The main one (let us call >>> it ``Generator``, as it is called in the prototype) will follow the new >>> policy: distribution methods can break the stream in feature releases. >>> There will also be a secondary distributions class (let us call it >>> ``LegacyGenerator``) which contains distribution methods exactly as they >>> exist in the current ``RandomState`` implementation. When one combines >>> ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the >>> exact same stream as ``RandomState`` for all distribution methods. The >>> ``LegacyGenerator`` methods will be forever frozen. >>> ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with >>> the MT19937 core PRNG, and whatever tricks needed to make >>> ``isinstance(prng, RandomState)`` and unpickling work should be done. This >>> way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be >>> deprecated, becoming progressively noisier over a number of release cycles, >>> in favor of explicitly instantiating ``LegacyGenerator``. >>> >> >> >>> >> >> ``LegacyGenerator`` CAN be used during this deprecation period in >>> library and application code until libraries and applications can migrate >>> to the new ``Generator``. Libraries and applications SHOULD migrate but >>> MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test >>> data for unit tests where cross-release stability of the streams is >>> important. Test writers SHOULD consider ways to mitigate their reliance on >>> such stability and SHOULD limit their usage to distribution methods that >>> have fewer cross-platform stability risks. >>> >> >>> >> I would appreciate your consideration of this proposal. Does it >>> address your concerns? It addresses my concerns with keeping around a >>> fully-functional RandomState implementation. >>> > >>> > My concerns are: >>> > 1. The amount of work caused by making libraries and end users migrate. >>> > 2. That this is a backwards compatibility break, which will cause >>> problems for users who relied on the old guarantees (the arguments in the >>> NEP that the old guarantees weren't 100% watertight don't mean that >>> backcompat doesn't matter at all). >>> > >>> > As far as I can tell, this new proposal doesn't deal with those >>> concerns directly. What it does seem to do is making transitioning a bit >>> easier for users that were already using RandomState instances. >>> >>> Let me drop the deprecation of the name RandomState. >>> RandomState(int_seed) will forever and always create a backwards- and >>> stream-compatible object. No one will have to migrate. >>> >>> How does that strike you? >>> >> >> Sounds good. >> > > > I'm trying to catch up here but I'm not sure what the latest version of > the proposal is. > > IMO we need a stable stream of random numbers for the various distribution > forever. Talking about deprecation misses the point that we don't want to > have to migrate our unit tests to a "non-stable" stream of random numbers. > > In terms of user API we need some instance of a random state or random > generator that can be used with scikit-learn's check_random_state (which > was copied to scipy and will be copied to statsmodels when we get around to > it.) > > IMO naming or pure API changes are fine, with deprecation of the old > "style", as long as the changes can be done mechanically, e.g. adding > "legacy" somewhere in the names or options. > E.g. for scikit-learn and scipy.stats it might be just a small change in > the check_random_state function, but maybe more changes in the unit tests > that actually use and create a Random stream. > > Implementation > I don't know or didn't pay enough attention to the details. > > The proposal sounds now like separating the distribution rvs generation > from the underlying random stream. I had thought that was already in the > proposal. > > If I were writing this for statsmodels, then I would hand a `method` > keyword around that defaults to `method=None` which uses the latest and > greatest available method independent of backwards compatibility, and > method='stable' or method='legacy' as alternative. And maybe some > distribution specific methods. > This is separate from the option which underlying PRNG method to use. > > IIUC, the choices in the proposal are now 3 combinations > > - legacy: MT19937 core PRNG + distribution_method='stable' > - mixed: MT19937 core PRNG + distribution_method=None > - new: ??? core PRNG + distribution_method=None > > where the second might be just a special case of the third option, so it > reduces to binary choice. > > aside to > > The best way that we've found to manage that difficulty is to > instantiate a single stream and use it all throughout your code. > > First, with check_random_state option it's up to a user > > As a user I have cases where I the use cases are independent and it > doesn't matter if it uses the same seed. > As a user I have cases where I would prefer if two methods use the same > random stream. (e.g. bootstrap confidence intervals computed with two > different methods where I wouldn't want the difference to come from > different random streams when I compare them.) > Also as a user, in some cases I used two different RandomState instances > to get random numbers for different parts of the simulation. (Example: I > generate y and x for a regression simulation separately, so that when I > increase the number of observations, the initial sample stays the same, > i.e. is recreated each time.) > > Some of this might make it into library code when statsmodels gets more > simulation and bootstrap methods. > > Test writers SHOULD consider ways to mitigate their reliance on such stability and SHOULD limit their usage to distribution methods that have fewer cross-platform stability risks. Is there somewhere a list on what might be unstable across platforms? In statsmodels we struggle quite a bit with cross-platform problems in the unit tests. But most of them are because many test tolerances are pretty tight, and then e.g. linalg noise might fluctuate too much across machines and LAPACK versions. Other cases are because behavior in not nice cases differs across machines and versions, those cases are sometimes added intentionally and sometimes by accident. But I don't think we had a problem because of random number generation. E.g. for integers we are mostly limited to small numbers, either like in Poisson because exp might overflow, or because test cases are usually small for speed reasons. Josef > > Josef > > >> >> Cheers, >> Ralf >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Mon Jun 11 13:10:34 2018 From: matti.picus at gmail.com (Matti Picus) Date: Mon, 11 Jun 2018 10:10:34 -0700 Subject: [Numpy-discussion] 1.14.5 bugfix release Message-ID: <502735bc-e86c-7060-3386-6ffda5b04731@gmail.com> An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jun 11 14:13:22 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 11 Jun 2018 12:13:22 -0600 Subject: [Numpy-discussion] 1.14.5 bugfix release In-Reply-To: <502735bc-e86c-7060-3386-6ffda5b04731@gmail.com> References: <502735bc-e86c-7060-3386-6ffda5b04731@gmail.com> Message-ID: On Mon, Jun 11, 2018 at 11:10 AM, Matti Picus wrote: > If there is a desire to do a bug-fix release 1.14.5 I would like to try my > hand at releasing it, using doc/RELEASE_WALKTHROUGH.rst.txt. There were a > few issues around compiling 1.14.4 on alpine and NetBSD. > Since 1.15 will probably be released soon, do we continue to push these > kind of bug fixes releases to 1.14.x? > Matti > We only need to make the release to fix the regressions. I was going to do it today/tomorrow as I think we have now covered all paths through the ifs. Usually it takes about 2-4 weeks for bug reports to settle out, but a think we can be a bit sooner here and the next release will be 1.15. If you want to give it a shot, go ahead. We need more people with some experience in the process, not to mention new perspectives on the walkthrough. I expect most of your time will be spent getting set up. I think you will also need commit privileges on `MacPython/numpy-wheels`, ping Matthew Brett for those. If you run into problems, let me know. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Tue Jun 12 02:35:56 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Mon, 11 Jun 2018 23:35:56 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Frozen dimensions: I started with just making every 3-vector and 3x3-matrix structured arrays with the relevant single sub-array entry I was actually suggesting omitting the structured dtype (ie, field names) altogether, and just using the subarray dtypes (which exist alone, but not in arrays). Another (small?) advantage is that I can use `axis This is a fair argument against my proposal - at any rate, I think we?d need a better story for subarray dtypes before trying to add support to them for ufuncs ------------------------------ Broadcasting dimensions But perhaps a putative weighted_mean ? is a decent example That?s fairly convincing as a non-chained ufunc case. Can you add an example like that to the NEP? Also, it has the benefit of being clear what the function can handle by inspection of the signature Is broadcasting (n),(n)->(),() less clear that (n|1),(n|1)->(),()? Can you come up with an example where only some dimensions make sense to broadcast? ------------------------------ Eric ? On Mon, 11 Jun 2018 at 08:04 Marten van Kerkwijk wrote: > Nathaniel: >> >> Output shape feels very similar to >> output dtype to me, so maybe the general way to handle this would be >> to make the first callback take the input shapes+dtypes and return the >> desired output shapes+dtypes? >> >> This hits on an interesting alternative to frozen dimensions - np.cross >> could just become a regular ufunc with signature np.dtype((float64, 3)), >> np.dtype((float64, 3)) → np.dtype((float64, 3)) >> > As you note further down, the present proposal of just using numbers has > the advantage of being clear and easy. Another (small?) advantage is that I > can use `axis` to tell where my three coordinates are, rather than be stuck > with having them as the last dimension. > > Indeed, in my trials for wrapping the Standards Of Fundamental Astronomy > routines, I started with just making every 3-vector and 3x3-matrix > structured arrays with the relevant single sub-array entry. That worked, > but I ended up disliking the casting to and fro. > > >> Furthermore, the expansion quickly becomes cumbersome. For instance, for >> the all_equal signature of (n|1),(n|1)->() ? >> >> I think this is only a good argument when used in conjunction with the >> broadcasting syntax. I don?t think it?s a reason for matmul not to have >> multiple signatures. Having multiple signatures is an disincentive to >> introduced too many overloads of the same function, which seems like a good >> thing to me >> > But implementation for matmul is actually considerably trickier, since the > internal loop now has to check the number of distinct dimensions. > > >> Summarizing my overall opinions: >> >> - I?m +0.5 on frozen dimensions. The use-cases seem reasonable, and >> it seems like an easy-ish way to get them. Allowing ufuncs to natively >> support subarray types might be a tidier solution, but that could come down >> the road >> >> Indeed, they are not mutually exclusive. My guess would be that the use > cases would be somewhat different. > > >> >> - I?m -1 on optional dimensions: they seem to legitimize creating >> many overloads of gufuncs. I?m already not a fan of how matmul has special >> cases for lower dimensions that don?t generalize well. To me, the best way >> to handle matmul would be to use the proposed __array_function__ to >> handle the shape-based special-case dispatching, either by: >> - Inserting dimensions, and calling the true gufunc >> np.linalg.matmul_2d (which is a function I?d like direct access to >> anyway). >> - Dispatching to one of four ufuncs >> >> I must admit I wish that `@` was just pure matrix multiplication... But > otherwise agree with Stephan as optional dimensions being the least-bad > solution. > > Aside: do agree we should think about how to expose the `linalg` gufuncs. > >> >> - Broadcasting dimensions: >> - I know you?re not suggesting this but: enabling broadcasting >> unconditionally for all gufuncs would be a bad idea, masking linalg bugs. >> (although einsum does support broadcasting?) >> >> Indeed, definitely *not* suggesting that! > > >> >> - >> - Does it really need a per-dimension flag, rather than a global >> one? Can you give a case where that?s useful? >> >> Mostly simply that the implementation is easier given the optional > dimensions... Also, it has the benefit of being clear what the function can > handle by inspection of the signature, i.e., it self-documents better (one > of my main arguments in favour of frozen dimensions...). > > >> >> - >> - If we?d already made all_equal a gufunc, I?d be +1 on adding >> broadcasting support to it >> - I?m -0.5 on the all_equal path in the first place. I think we >> either should have a more generic approach to combined ufuncs, or just >> declare them numbas job. >> >> I am working on and off on a way to generically chain ufuncs (goal would > be to auto-create an inner loop that calls all the chained ufuncs loops in > turn). Not sure that short-circuiting will be all that easy. > > I actually quite like the all_equal ufunc, but it is in part because I > remember discovering how painfully slow (a==b).all() was (and still have a > place where I would use it if it existed). And it does fit in the > (admittedly vague) plans to try to make `.reduce` a gufunc. > >> >> - >> - Can you come up with a broadcasting use-case that isn?t just >> chaining a reduction with a broadcasting ufunc? >> >> Perhaps the use is that it allows people to write gufuncs that are like > such functions... Absent a mechanism to chain ufuncs, more complicated > gufuncs are currently the easiest way to get fast more complicated algebra. > > But perhaps a putative > > weighted_mean(y, sigma) -> mean, sigma_mean > > is a decent example? Its signature would be > > (n),(n)->(),() > > but then you're forced to give individual sigmas for each point. With > > (n|1),(n|1)->(),() > > you are no longer forced to do that (though the case of all y being the > same is less than useful here... I did at some point have an implementation > that worked by core dimension of each argument, but ended up feeling it was > not worth the extra complication) > > -- Marten > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Tue Jun 12 02:59:36 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Mon, 11 Jun 2018 23:59:36 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: I don?t understand your alternative here. If we overload np.matmul using *array_function*, then it would not use *ether* of these options for writing the operation in terms of other gufuncs. It would simply look for an *array_function* attribute, and call that method instead. Let me explain that suggestion a little more clearly. 1. There?d be a linalg.matmul2d that performs the real matrix case, which would be easy to make as a ufunc right now. 2. __matmul__ and __rmatmul__ would just call np.matmul, as they currently do (for consistency between np.matmul and operator.matmul, needed in python pre- at -operator) 3. np.matmul would be implemented as: @do_array_function_overridesdef matmul(a, b): if a.ndim != 1 and b.ndim != 1: return matmul2d(a, b) elif a.ndim != 1: return matmul2d(a, b[:,None])[...,0] elif b.ndim != 1: return matmul2d(a[None,:], b) else: # this one probably deserves its own ufunf return matmul2d(a[None,:], b[:,None])[0,0] 4. Quantity can just override __array_ufunc__ as with any other ufunc 5. DataArray, knowing the above doesn?t work, would implement something like @matmul.register_array_function(DataArray)def __array_function__(a, b): if a.ndim != 1 and b.ndim != 1: return matmul2d(a, b) else: # either: # - add/remove dummy dimensions in a dataarray-specific way # - downcast to ndarray and do the dimension juggling there Advantages of this approach: - Neither the ufunc machinery, nor __array_ufunc__, nor the inner loop, need to know about optional dimensions. - We get a matmul2d ufunc, that all subclasses support out of the box if they support matmul Eric ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 12 17:26:25 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 12 Jun 2018 15:26:25 -0600 Subject: [Numpy-discussion] SciPy 2018 Message-ID: Hi All, Thought I'd raise the topic of meeting up at SciPy 2018. I wasn't planning on registering for the main conference, but would be happy to fly down for a couple of days if we plan on a meetup during sprints or some other point in the conference schedule. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Jun 12 17:40:09 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 12 Jun 2018 14:40:09 -0700 Subject: [Numpy-discussion] SciPy 2018 In-Reply-To: References: Message-ID: On 12/06/18 14:26, Charles R Harris wrote: > Hi All, > > Thought I'd raise the topic of meeting up at SciPy 2018. I wasn't > planning on registering for the main conference, but would be happy to > fly down for a couple of days if we plan on a meetup during sprints or > some other point in the conference schedule. > > Chuck > There will be a NumPy sprint July 14-15. I have requested a BOF room. For the BOF, I hoped to continue the discussion of the NumPy roadmap https://github.com/numpy/numpy/wiki/NumPy-Roadmap as well as provide a forum to meet in person. Matti From matti.picus at gmail.com Tue Jun 12 18:22:07 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 12 Jun 2018 15:22:07 -0700 Subject: [Numpy-discussion] Permissions to upload to PyPI Message-ID: Almost ready to finish the 1.14.5 release, but it seems I need permissions to upload to PyPI (makes sense). My user name there is mattip. Can someone help out? Matti From charlesr.harris at gmail.com Tue Jun 12 18:26:18 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 12 Jun 2018 16:26:18 -0600 Subject: [Numpy-discussion] Permissions to upload to PyPI In-Reply-To: References: Message-ID: On Tue, Jun 12, 2018 at 4:22 PM, Matti Picus wrote: > Almost ready to finish the 1.14.5 release, but it seems I need permissions > to upload to PyPI (makes sense). My user name there is mattip. Can someone > help out? > Matti > Done. Sorry I missed that. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Tue Jun 12 18:38:02 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 12 Jun 2018 16:38:02 -0600 Subject: [Numpy-discussion] NumPy 1.15.x branched. Message-ID: Hi All, NumPy 1.15.x has been branched and master is now open for 1.16 development. If there are any remaining PRs that *just have to be in 1.15*, please complain here :0 Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Jun 12 20:09:28 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 12 Jun 2018 17:09:28 -0700 Subject: [Numpy-discussion] NumPy 1.14.5 released Message-ID: <7466261a-7d28-da89-d7a4-c2494ebae3ce@gmail.com> Hi All, I am pleased to announce the release of NumPy 14.4.5. This is a bugfix release for bugs reported following the 1.14.4 release. The most significant fixes are: * fixes for compilation errors on alpine and NetBSD The Python versions supported in this release are 2.7 and 3.4 - 3.6. The Python 3.6 wheels available from PIP are built with Python 3.6.2 and should be compatible with all previous versions of Python 3.6. The source releases were cythonized with Cython 0.28.2 and should work for the upcoming Python 3.7. Contributors ============ A total of 1 person contributed to this release.? People with a "+" by their names contributed a patch for the first time. * Charles Harris Pull requests merged ==================== A total of 2 pull requests were merged for this release. * `#11274 `__: BUG: Correct use of NPY_UNUSED. * `#11294 `__: BUG: Remove extra trailing parentheses. Cheers, Matti From m.h.vankerkwijk at gmail.com Tue Jun 12 21:13:47 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Tue, 12 Jun 2018 21:13:47 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: On Tue, Jun 12, 2018 at 2:35 AM, Eric Wieser wrote: > Frozen dimensions: > > I started with just making every 3-vector and 3x3-matrix structured arrays > with the relevant single sub-array entry > > I was actually suggesting omitting the structured dtype (ie, field names) > altogether, and just using the subarray dtypes (which exist alone, but not > in arrays). > > Another (small?) advantage is that I can use `axis > > This is a fair argument against my proposal - at any rate, I think we?d > need a better story for subarray dtypes before trying to add support to > them for ufuncs > Yes, I've been wondering about the point of the sub-arrays... They seem interesting but in arrays just disappear. Their possible use would be to change the shape as seen by the outside world (as happens if one does define that sub-array as a 1-part structured array). Anyway, for another discussion! > ------------------------------ > > Broadcasting dimensions > > But perhaps a putative weighted_mean ? is a decent example > > That?s fairly convincing as a non-chained ufunc case. Can you add an > example like that to the NEP? > Done. > Also, it has the benefit of being clear what the function can handle by > inspection of the signature > > Is broadcasting (n),(n)->(),() less clear that (n|1),(n|1)->(),()? Can > you come up with an example where only some dimensions make sense to > broadcast? > Not a super-convincing one, though I guess one could think of a similar function for 3-vectors (which somehow must care about those being three-dimensional, because, say, it calculates the average direction of the cross product in spherical angles...), then, in the signature `(n,3),(n,3)->(),(),(),()` one would like to indicate that the `n` could be broadcast, but the `3` could not. As I now write in the NEP, part of the reason of doing it by distinct dimension is that I already need a flag for flexible, so it is easy to add one for broadcastable; similarly, in the actual code, there is quite a bit of shared stuff. -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 13 17:27:50 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 13 Jun 2018 15:27:50 -0600 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 Message-ID: Hi All, I think NumPy 1.16 would be a good time to drop Python 3.4 support. We will want to do that anyway once we drop 2.7 so that we will only be using recent Windows compilers, and with Python 3.7 due at the end of the month I think supporting 3.5-7 for 1.16 should be sufficient. Thoughts? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jun 13 17:45:23 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 13 Jun 2018 14:45:23 -0700 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: This sounds good to me. Most of the downstream projects I work with have already dropped Python 3.4 support. On Wed, Jun 13, 2018 at 2:30 PM Charles R Harris wrote: > Hi All, > > I think NumPy 1.16 would be a good time to drop Python 3.4 support. We > will want to do that anyway once we drop 2.7 so that we will only be using > recent Windows compilers, and with Python 3.7 due at the end of the month I > think supporting 3.5-7 for 1.16 should be sufficient. > > Thoughts? > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.barker at noaa.gov Wed Jun 13 17:56:06 2018 From: chris.barker at noaa.gov (Chris Barker) Date: Wed, 13 Jun 2018 14:56:06 -0700 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: > > I think NumPy 1.16 would be a good time to drop Python 3.4 support. >> > +1 Using python3 before 3.5 was still kinda "bleeding edge" -- so projects are more likely to be actively upgrading. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: From millman at berkeley.edu Wed Jun 13 18:15:20 2018 From: millman at berkeley.edu (Jarrod Millman) Date: Wed, 13 Jun 2018 15:15:20 -0700 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: +1 On Wed, Jun 13, 2018 at 2:27 PM, Charles R Harris wrote: > Hi All, > > I think NumPy 1.16 would be a good time to drop Python 3.4 support. We will > want to do that anyway once we drop 2.7 so that we will only be using recent > Windows compilers, and with Python 3.7 due at the end of the month I think > supporting 3.5-7 for 1.16 should be sufficient. > > Thoughts? > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From charlesr.harris at gmail.com Wed Jun 13 20:10:50 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 13 Jun 2018 18:10:50 -0600 Subject: [Numpy-discussion] Updated 1.15.0 release notes Message-ID: Hi All, There is a PR for the updated NumPy 1.15.0 release notes . I would appreciate it if all those involved in the thatn release would have a look and fix incorrect or missing notes. Cheers, Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Wed Jun 13 20:28:12 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Wed, 13 Jun 2018 19:28:12 -0500 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: Hi Chuck, Are you planning on doing an rc release this time? I think the NumPy 1.14 release was unusually bumpy and part of that was the lack of an rc. One example: importing h5py caused a warning under numpy 1.14 and an h5py release didn?t come out with a workaround or fix for a couple months. There was also an issue with array printing that caused problems in yt (although both yt and NumPy quickly did bugfix releases that fixed that). I guess 1.14 was particularly noisy, but still I?d really appreciate having a prerelease version to test against and some time to report issues with the prerelease so numpy and other projects can implement workarounds as needed without doing a release that might potentially break real users who happen to install right after numpy 1.x.0 comes out. Best, Nathan Goldbaum On Wed, Jun 13, 2018 at 7:11 PM Charles R Harris wrote: > Hi All, > > There is a PR for the updated NumPy 1.15.0 release notes > . I would appreciate it if > all those involved in the thatn release would have a look and fix incorrect > or missing notes. > > Cheers, > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Wed Jun 13 20:33:39 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Wed, 13 Jun 2018 20:33:39 -0400 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: Request for a -rc seconded (although this time we should be fine for astropy, as things are working well with -dev). -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Wed Jun 13 20:42:10 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Wed, 13 Jun 2018 18:42:10 -0600 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: On Wed, Jun 13, 2018 at 6:28 PM, Nathan Goldbaum wrote: > Hi Chuck, > > Are you planning on doing an rc release this time? I think the NumPy 1.14 > release was unusually bumpy and part of that was the lack of an rc. One > example: importing h5py caused a warning under numpy 1.14 and an h5py > release didn?t come out with a workaround or fix for a couple months. There > was also an issue with array printing that caused problems in yt (although > both yt and NumPy quickly did bugfix releases that fixed that). > > I guess 1.14 was particularly noisy, but still I?d really appreciate > having a prerelease version to test against and some time to report issues > with the prerelease so numpy and other projects can implement workarounds > as needed without doing a release that might potentially break real users > who happen to install right after numpy 1.x.0 comes out. > There was a 1.14.0rc1 . I was too quick for the full release, just waited three weeks, so maybe four this time. Too few people actually test the candidates and give feedback, so I tend to regard the *.*.0 releases as the true rc :) Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Wed Jun 13 21:16:45 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Wed, 13 Jun 2018 20:16:45 -0500 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: OK I guess I missed that announcement. I wouldn?t mind more than one email with a reminder to test. On Wed, Jun 13, 2018 at 7:42 PM Charles R Harris wrote: > On Wed, Jun 13, 2018 at 6:28 PM, Nathan Goldbaum > wrote: > >> Hi Chuck, >> >> Are you planning on doing an rc release this time? I think the NumPy 1.14 >> release was unusually bumpy and part of that was the lack of an rc. One >> example: importing h5py caused a warning under numpy 1.14 and an h5py >> release didn?t come out with a workaround or fix for a couple months. There >> was also an issue with array printing that caused problems in yt (although >> both yt and NumPy quickly did bugfix releases that fixed that). >> >> I guess 1.14 was particularly noisy, but still I?d really appreciate >> having a prerelease version to test against and some time to report issues >> with the prerelease so numpy and other projects can implement workarounds >> as needed without doing a release that might potentially break real users >> who happen to install right after numpy 1.x.0 comes out. >> > > There was a 1.14.0rc1 > . I was too quick > for the full release, just waited three weeks, so maybe four this time. Too > few people actually test the candidates and give feedback, so I tend to > regard the *.*.0 releases as the true rc :) > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Thu Jun 14 04:48:14 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Thu, 14 Jun 2018 09:48:14 +0100 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: Hi Nathan, One very helpful think you could do, is add a Travis-CI matrix entry where you are testing against the latest numpy nightly builds. I got a bit lost in your tox setup, but the basic idea is that, for one test entry, you add the following flags to pip: -f https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com --pre In that case, you'll pull in the latest nightly build of Numpy. See the Scipy .travis.yml setup for an example. Cheers, Matthew On Thu, Jun 14, 2018 at 2:16 AM, Nathan Goldbaum wrote: > OK I guess I missed that announcement. > > I wouldn?t mind more than one email with a reminder to test. > > On Wed, Jun 13, 2018 at 7:42 PM Charles R Harris > wrote: >> >> On Wed, Jun 13, 2018 at 6:28 PM, Nathan Goldbaum >> wrote: >>> >>> Hi Chuck, >>> >>> Are you planning on doing an rc release this time? I think the NumPy 1.14 >>> release was unusually bumpy and part of that was the lack of an rc. One >>> example: importing h5py caused a warning under numpy 1.14 and an h5py >>> release didn?t come out with a workaround or fix for a couple months. There >>> was also an issue with array printing that caused problems in yt (although >>> both yt and NumPy quickly did bugfix releases that fixed that). >>> >>> I guess 1.14 was particularly noisy, but still I?d really appreciate >>> having a prerelease version to test against and some time to report issues >>> with the prerelease so numpy and other projects can implement workarounds as >>> needed without doing a release that might potentially break real users who >>> happen to install right after numpy 1.x.0 comes out. >> >> >> There was a 1.14.0rc1. I was too quick for the full release, just waited >> three weeks, so maybe four this time. Too few people actually test the >> candidates and give feedback, so I tend to regard the *.*.0 releases as the >> true rc :) >> >> Chuck >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > From m.h.vankerkwijk at gmail.com Thu Jun 14 10:44:57 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 14 Jun 2018 10:44:57 -0400 Subject: [Numpy-discussion] Updated 1.15.0 release notes In-Reply-To: References: Message-ID: Indeed, we do something similar in astropy, with a pre-release failure being considered breakage (rather than ignorable as for -dev): https://github.com/astropy/astropy/blob/master/.travis.yml#L142 ?-- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Thu Jun 14 13:50:29 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 14 Jun 2018 13:50:29 -0400 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: It seems everyone is in favour - anybody in for making a PR reducing the travis testing accordingly? (It seems a bit of overkill more generally - would be good to reduce the kWhr footprint a little...) -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From teoliphant at gmail.com Thu Jun 14 14:09:54 2018 From: teoliphant at gmail.com (Travis Oliphant) Date: Thu, 14 Jun 2018 13:09:54 -0500 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: It is a welcome thing to see Python 2.7 support disappearing. Dropping 3.4 support in new releases sounds like a great idea as well. NumPy was originally pitched as a Python 3 thing... Travis On Thu, Jun 14, 2018, 12:52 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > It seems everyone is in favour - anybody in for making a PR reducing the > travis testing accordingly? (It seems a bit of overkill more generally - > would be good to reduce the kWhr footprint a little...) -- Marten > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Thu Jun 14 14:13:54 2018 From: matti.picus at gmail.com (Matti Picus) Date: Thu, 14 Jun 2018 11:13:54 -0700 Subject: [Numpy-discussion] Circle CI moving from 1.0 to 2.0 Message-ID: I stumbled across this notice (only seems to appear in a failed build) "This project is currently running on CircleCI 1.0 which will no longer be supported after August 31, 2018. Please start migrating this project to CircleCI 2.0 ." Here is the original link https://circleci.com/gh/numpy/numpy/2080 Is this an artifact that can be ignored or do we need to migrate, if so has anyone already done it for their project? Matti From einstein.edison at gmail.com Thu Jun 14 14:35:44 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Thu, 14 Jun 2018 14:35:44 -0400 Subject: [Numpy-discussion] Circle CI moving from 1.0 to 2.0 In-Reply-To: References: Message-ID: Hi Matti, It seems the CircleCI config is already on Version 2.0. See here, notice the 2.0 in front of every successful build. https://circleci.com/gh/numpy/numpy I can also see that some failed builds have 1.0 in front of them... But this shouldn't happen. Most likely this is a CircleCI issue, not one with our configuration. It can be safely ignored. Regards, Hameer Abbasi On 14/06/2018 at 23:13, Matti wrote: I stumbled across this notice (only seems to appear in a failed build) "This project is currently running on CircleCI 1.0 which will no longer be supported after August 31, 2018. Please start migrating this project to CircleCI 2.0 ." Here is the original link https://circleci.com/gh/numpy/numpy/2080 Is this an artifact that can be ignored or do we need to migrate, if so has anyone already done it for their project? Matti _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Thu Jun 14 15:09:10 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Thu, 14 Jun 2018 12:09:10 -0700 Subject: [Numpy-discussion] Dropping Python 3.4 support for NumPy 1.16 In-Reply-To: References: Message-ID: It was a small task. I created a PR for it here . Feel free to merge after CI passes or close. Hameer Abbasi Sent from Astro for Mac On 14. Jun 2018 at 22:50, Marten van Kerkwijk wrote: It seems everyone is in favour - anybody in for making a PR reducing the travis testing accordingly? (It seems a bit of overkill more generally - would be good to reduce the kWhr footprint a little...) -- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Fri Jun 15 10:07:22 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 15 Jun 2018 10:07:22 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Hi All, The discussion on the gufunc signature enhancements seems to have stalled a bit, but while it was going I've tried to update the NEP correspondingly. The NEP is now merged, so can viewed more easily, at http://www.numpy.org/neps/nep-0020-gufunc-signature-enhancement.html My own quite possibly biased summary of the discussion so far is that: 1) Frozen dimensions are generally seen as a good idea; other implementations may be possible, but are not as clear. 2) Flexible dimensions have little use beyond matmul; the main discussion is whether there is a better way. In my opinion, the main benefit of the current proposal is that it allows operator overrides to all work the same way (via __array_ufunc__), independent of any assumptions about the object that does the override (such as that it has a shape). 3) Broadcastable dimensions had less support, but mostly for lack of examples; there now is one beyond all_equal, for which a gufunc is more clearly the proper route: a weighted average (which has obvious extensions). A general benefit of course is that there is actual code for all three; it would certainly be nice if we could fully support `matmul` and `@` in 1.16. So, the question would seem whether the NEP should be accepted or rejected (with part acceptance of course being possible, though I note that flexible and broadcastable share a lot of implementation, so in my opinion it is somewhat pointless to do just one of them). All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jun 15 14:17:09 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 15 Jun 2018 11:17:09 -0700 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: On Mon, Jun 11, 2018 at 11:59 PM Eric Wieser wrote: > I don?t understand your alternative here. If we overload np.matmul using > *array_function*, then it would not use *ether* of these options for > writing the operation in terms of other gufuncs. It would simply look for > an *array_function* attribute, and call that method instead. > > Let me explain that suggestion a little more clearly. > > 1. There?d be a linalg.matmul2d that performs the real matrix case, > which would be easy to make as a ufunc right now. > 2. __matmul__ and __rmatmul__ would just call np.matmul, as they > currently do (for consistency between np.matmul and operator.matmul, > needed in python pre- at -operator) > 3. np.matmul would be implemented as: > > @do_array_function_overridesdef matmul(a, b): > if a.ndim != 1 and b.ndim != 1: > return matmul2d(a, b) > elif a.ndim != 1: > return matmul2d(a, b[:,None])[...,0] > elif b.ndim != 1: > return matmul2d(a[None,:], b) > else: > # this one probably deserves its own ufunf > return matmul2d(a[None,:], b[:,None])[0,0] > > 4. Quantity can just override __array_ufunc__ as with any other ufunc > 5. DataArray, knowing the above doesn?t work, would implement > something like > > @matmul.register_array_function(DataArray)def __array_function__(a, b): > if a.ndim != 1 and b.ndim != 1: > return matmul2d(a, b) > else: > # either: > # - add/remove dummy dimensions in a dataarray-specific way > # - downcast to ndarray and do the dimension juggling there > > > Advantages of this approach: > > - > > Neither the ufunc machinery, nor __array_ufunc__, nor the inner loop, > need to know about optional dimensions. > - > > We get a matmul2d ufunc, that all subclasses support out of the box if > they support matmul > > Eric > OK, this sounds pretty reasonable to me -- assuming we manage to figure out the __array_function__ proposal! There's one additional ingredient we would need to make this work well: some way to guarantee that "ndim" and indexing operations are available without casting to a base numpy array. For now, np.asanyarray() would probably suffice, but that isn't quite right (e.g., this would fail for np.matrix). In the long term, I think we need a new coercion protocol for "duck" arrays. Nathaniel Smith and I started writing a NEP on this, but it isn't quite ready yet. > ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Jun 16 03:38:57 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 16 Jun 2018 00:38:57 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: I have incorporated the feedback from this thread, and have significantly altered the proposal. I think this version will be more palatable to everyone. https://github.com/numpy/numpy/pull/11356 https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst I'm pretty sure that Kevin Sheppard's prototype already implements the broad strokes of my proposal (seriously, he thinks of everything; I'm just playing catch up), so I don't think there is any technical risk. I think it's just a matter of the fine details of shoving this into numpy.random per se rather than a third party package. https://bashtage.github.io/randomgen/devel/legacy.html --- ============================== Random Number Generator Policy ============================== :Author: Robert Kern :Status: Draft :Type: Standards Track :Created: 2018-05-24 Abstract -------- For the past decade, NumPy has had a strict backwards compatibility policy for the number stream of all of its random number distributions. Unlike other numerical components in ``numpy``, which are usually allowed to return different when results when they are modified if they remain correct, we have obligated the random number distributions to always produce the exact same numbers in every version. The objective of our stream-compatibility guarantee was to provide exact reproducibility for simulations across numpy versions in order to promote reproducible research. However, this policy has made it very difficult to enhance any of the distributions with faster or more accurate algorithms. After a decade of experience and improvements in the surrounding ecosystem of scientific software, we believe that there are now better ways to achieve these objectives. We propose relaxing our strict stream-compatibility policy to remove the obstacles that are in the way of accepting contributions to our random number generation capabilities. The Status Quo -------------- Our current policy, in full: A fixed seed and a fixed series of calls to ``RandomState`` methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged. This policy was first instated in Nov 2008 (in essence; the full set of weasel words grew over time) in response to a user wanting to be sure that the simulations that formed the basis of their scientific publication could be reproduced years later, exactly, with whatever version of ``numpy`` that was current at the time. We were keen to support reproducible research, and it was still early in the life of ``numpy.random``. We had not seen much cause to change the distribution methods all that much. We also had not thought very thoroughly about the limits of what we really could promise (and by ?we? in this section, we really mean Robert Kern, let?s be honest). Despite all of the weasel words, our policy overpromises compatibility. The same version of ``numpy`` built on different platforms, or just in a different way could cause changes in the stream, with varying degrees of rarity. The biggest is that the ``.multivariate_normal()`` method relies on ``numpy.linalg`` functions. Even on the same platform, if one links ``numpy`` with a different LAPACK, ``.multivariate_normal()`` may well return completely different results. More rarely, building on a different OS or CPU can cause differences in the stream. We use C ``long`` integers internally for integer distribution (it seemed like a good idea at the time), and those can vary in size depending on the platform. Distribution methods can overflow their internal C ``longs`` at different breakpoints depending on the platform and cause all of the random variate draws that follow to be different. And even if all of that is controlled, our policy still does not provide exact guarantees across versions. We still do apply bug fixes when correctness is at stake. And even if we didn?t do that, any nontrivial program does more than just draw random numbers. They do computations on those numbers, transform those with numerical algorithms from the rest of ``numpy``, which is not subject to so strict a policy. Trying to maintain stream-compatibility for our random number distributions does not help reproducible research for these reasons. The standard practice now for bit-for-bit reproducible research is to pin all of the versions of code of your software stack, possibly down to the OS itself. The landscape for accomplishing this is much easier today than it was in 2008. We now have ``pip``. We now have virtual machines. Those who need to reproduce simulations exactly now can (and ought to) do so by using the exact same version of ``numpy``. We do not need to maintain stream-compatibility across ``numpy`` versions to help them. Our stream-compatibility guarantee has hindered our ability to make improvements to ``numpy.random``. Several first-time contributors have submitted PRs to improve the distributions, usually by implementing a faster, or more accurate algorithm than the one that is currently there. Unfortunately, most of them would have required breaking the stream to do so. Blocked by our policy, and our inability to work around that policy, many of those contributors simply walked away. Implementation -------------- Work on a proposed new PRNG subsystem is already underway in the randomgen_ project. The specifics of the new design are out of scope for this NEP and up for much discussion, but we will discuss general policies that will guide the evolution of whatever code is adopted. We will also outline just a few of the requirements that such a new system must have to support the policy proposed in this NEP. First, we will maintain API source compatibility just as we do with the rest of ``numpy``. If we *must* make a breaking change, we will only do so with an appropriate deprecation period and warnings. Second, breaking stream-compatibility in order to introduce new features or improve performance will be *allowed* with *caution*. Such changes will be considered features, and as such will be no faster than the standard release cadence of features (i.e. on ``X.Y`` releases, never ``X.Y.Z``). Slowness will not be considered a bug for this purpose. Correctness bug fixes that break stream-compatibility can happen on bugfix releases, per usual, but developers should consider if they can wait until the next feature release. We encourage developers to strongly weight user?s pain from the break in stream-compatibility against the improvements. One example of a worthwhile improvement would be to change algorithms for a significant increase in performance, for example, moving from the `Box-Muller transform `_ method of Gaussian variate generation to the faster `Ziggurat algorithm `_. An example of a discouraged improvement would be tweaking the Ziggurat tables just a little bit for a small performance improvement. Any new design for the RNG subsystem will provide a choice of different core uniform PRNG algorithms. A promising design choice is to make these core uniform PRNGs their own lightweight objects with a minimal set of methods (randomgen_ calls them ?basic RNGs?). The broader set of non-uniform distributions will be its own class that holds a reference to one of these core uniform PRNG objects and simply delegates to the core uniform PRNG object when it needs uniform random numbers. To borrow an example from randomgen_, the class ``MT19937`` is a basic RNG that implements the classic Mersenne Twister algorithm. The class ``RandomGenerator`` wraps around the basic RNG to provide all of the non-uniform distribution methods:: # This is not the only way to instantiate this object. # This is just handy for demonstrating the delegation. >>> brng = MT19937(seed) >>> rg = RandomGenerator(brng) >>> x = rg.standard_normal(10) We will be more strict about a select subset of methods on these basic RNG objects. They MUST guarantee stream-compatibility for a specified set of methods which are chosen to make it easier to compose them to build other distributions and which are needed to abstract over the implementation details of the variety of core PRNG algorithms. Namely, * ``.bytes()`` * ``.random_uintegers()`` * ``.random_sample()`` The distributions class (``RandomGenerator``) SHOULD have all of the same distribution methods as ``RandomState`` with close-enough function signatures such that almost all code that currently works with ``RandomState`` instances will work with ``RandomGenerator`` instances (ignoring the precise stream values). Some variance will be allowed for integer distributions: in order to avoid some of the cross-platform problems described above, these SHOULD be rewritten to work with ``uint64`` numbers on all platforms. .. _randomgen: https://github.com/bashtage/randomgen Supporting Unit Tests ::::::::::::::::::::: Because we did make a strong stream-compatibility guarantee early in numpy?s life, reliance on stream-compatibility has grown beyond reproducible simulations. One use case that remains for stream-compatibility across numpy versions is to use pseudorandom streams to generate test data in unit tests. With care, many of the cross-platform instabilities can be avoided in the context of small unit tests. The new PRNG subsystem MUST provide a second, legacy distributions class that uses the same implementations of the distribution methods as the current version of ``numpy.random.RandomState``. The methods of this class will keep the same strict stream-compatibility guarantees. It is intended that this class will no longer be modified, except to keep it working when numpy internals change. All new development should go into the primary distributions class. The purpose of ``RandomState`` will be documented as providing certain fixed functionality for backwards compatibility and stable numbers for the limited purpose of unit testing, and not making whole programs reproducible across numpy versions. This legacy distributions class MUST be accessible under the name ``numpy.random.RandomState`` for backwards compatibility. All current ways of instantiating ``numpy.random.RandomState`` with a given state should instantiate the Mersenne Twister basic RNG with the same state. The legacy distributions class MUST be capable of accepting other basic RNGs. The purpose here is to ensure that one can write a program with a consistent basic RNG state with a mixture of libraries that may or may not have upgraded from ``RandomState``. Instances of the legacy distributions class MUST respond ``True`` to ``isinstance(rg, numpy.random.RandomState)`` because there is current utility code that relies on that check. Similarly, old pickles of ``numpy.random.RandomState`` instances MUST unpickle correctly. ``numpy.random.*`` :::::::::::::::::: The preferred best practice for getting reproducible pseudorandom numbers is to instantiate a generator object with a seed and pass it around. The implicit global ``RandomState`` behind the ``numpy.random.*`` convenience functions can cause problems, especially when threads or other forms of concurrency are involved. Global state is always problematic. We categorically recommend avoiding using the convenience functions when reproducibility is involved. That said, people do use them and use ``numpy.random.seed()`` to control the state underneath them. It can be hard to categorize and count API usages consistently and usefully, but a very common usage is in unit tests where many of the problems of global state are less likely. The initial release of the new PRNG subsystem MUST leave these convenience functions as aliases to the methods on a global ``RandomState`` that is initialized with a Mersenne Twister basic RNG object. A call to ``numpy.random.seed()`` will be forwarded to that basic RNG object. In order to allow certain workarounds, it MUST be possible to replace the basic RNG underneath the global ``RandomState`` with any other basic RNG object (we leave the precise API details up to the new subsystem). Calling ``numpy.random.seed()`` thereafter SHOULD just pass the given seed to the current basic RNG object and not attempt to reset the basic RNG to the Mersenne Twister. The global ``RandomState`` instance MUST be accessible by the name ``numpy.random.mtrand._rand``: Robert Kern long ago promised ``scikit-learn`` that this name would be stable. Whoops. The set of ``numpy.random.*`` convenience functions SHALL remain the same as they currently are. They SHALL be aliases to the ``RandomState`` methods and not the new less-stable distributions class (``RandomGenerator``, in the examples above). Users who want to get the fastest, best distributions can follow best practices and instantiate generator objects explicitly. After we have experience with the new PRNG subsystem, we can and should revisit these issues in future NEPs. Alternatives ------------ Versioning :::::::::: For a long time, we considered that the way to allow algorithmic improvements while maintaining the stream was to apply some form of versioning. That is, every time we make a stream change in one of the distributions, we increment some version number somewhere. ``numpy.random`` would keep all past versions of the code, and there would be a way to get the old versions. We will not be doing this. If one needs to get the exact bit-for-bit results from a given version of ``numpy``, whether one uses random numbers or not, one should use the exact version of ``numpy``. Proposals of how to do RNG versioning varied widely, and we will not exhaustively list them here. We spent years going back and forth on these designs and were not able to find one that sufficed. Let that time lost, and more importantly, the contributors that we lost while we dithered, serve as evidence against the notion. Concretely, adding in versioning makes maintenance of ``numpy.random`` difficult. Necessarily, we would be keeping lots of versions of the same code around. Adding a new algorithm safely would still be quite hard. But most importantly, versioning is fundamentally difficult to *use* correctly. We want to make it easy and straightforward to get the latest, fastest, best versions of the distribution algorithms; otherwise, what's the point? The way to make that easy is to make the latest the default. But the default will necessarily change from release to release, so the user?s code would need to be altered anyway to specify the specific version that one wants to replicate. Adding in versioning to maintain stream-compatibility would still only provide the same level of stream-compatibility that we currently do, with all of the limitations described earlier. Given that the standard practice for such needs is to pin the release of ``numpy`` as a whole, versioning ``RandomState`` alone is superfluous. ``StableRandom`` :::::::::::::::: A previous version of this NEP proposed to leave ``RandomState`` completely alone for a deprecation period and build the new subsystem alongside with new names. To satisfy the unit testing use case, it proposed introducing a small distributions class nominally called ``StableRandom``. It would have provided a small subset of distribution methods that were considered most useful in unit testing, but not the full set such that it would be too likely to be used outside of the testing context. During discussion about this proposal, it became apparent that there was no satisfactory subset. At least some projects used a fairly broad selection of the ``RandomState`` methods in unit tests. Downstream project owners would have been forced to modify their code to accomodate the new PRNG subsystem. Some modifications might be simply mechanical, but the bulk of the work would have been tedious churn for no positive improvement to the downstream project, just avoiding being broken. Furthermore, under this old proposal, we would have had a quite lengthy deprecation period where ``RandomState`` existed alongside the new system of basic RNGs and distribution classes. Leaving the implementation of ``RandomState`` fixed meant that it could not use the new basic RNG state objects. Developing programs that use a mixture of libraries that have and have not upgraded would require managing two sets of PRNG states. This would notionally have been time-limited, but we intended the deprecation to be very long. The current proposal solves all of these problems. All current usages of ``RandomState`` will continue to work in perpetuity, though some may be discouraged through documentation. Unit tests can continue to use the full complement of ``RandomState`` methods. Mixed ``RandomState/RandomGenerator`` code can safely share the common basic RNG state. Unmodified ``RandomState`` code can make use of the new features of alternative basic RNGs like settable streams. Discussion ---------- - `NEP discussion < https://mail.python.org/pipermail/numpy-discussion/2018-June/078126.html>`_ - `Earlier discussion < https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html >`_ Copyright --------- This document has been placed in the public domain. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Sat Jun 16 14:01:15 2018 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Sat, 16 Jun 2018 11:01:15 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 16, 2018 at 12:38 AM, Robert Kern wrote: > I have incorporated the feedback from this thread, and have significantly > altered the proposal. I think this version will be more palatable to > everyone. > > https://github.com/numpy/numpy/pull/11356 > https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep- > 0019-rng-policy.rst > > I'm pretty sure that Kevin Sheppard's prototype already implements the > broad strokes of my proposal (seriously, he thinks of everything; I'm just > playing catch up), so I don't think there is any technical risk. I think > it's just a matter of the fine details of shoving this into numpy.random > per se rather than a third party package. > > https://bashtage.github.io/randomgen/devel/legacy.html > > --- > > ============================== > Random Number Generator Policy > ============================== > > :Author: Robert Kern > :Status: Draft > :Type: Standards Track > :Created: 2018-05-24 > Thanks Robert. The whole proposal looks good to me now, just one minor comment below. > > The initial release of the new PRNG subsystem MUST leave these convenience > functions as aliases to the methods on a global ``RandomState`` that is > initialized with a Mersenne Twister basic RNG object. A call to > ``numpy.random.seed()`` will be forwarded to that basic RNG object. In > order > to allow certain workarounds, it MUST be possible to replace the basic RNG > underneath the global ``RandomState`` with any other basic RNG object (we > leave > the precise API details up to the new subsystem). Calling > ``numpy.random.seed()`` > thereafter SHOULD just pass the given seed to the current basic RNG object > and > not attempt to reset the basic RNG to the Mersenne Twister. The global > ``RandomState`` instance MUST be accessible by the name > ``numpy.random.mtrand._rand``: Robert Kern long ago promised > ``scikit-learn`` > that this name would be stable. Whoops. > This is a little weird; "mtrand" is an implementation detail already. There's exactly 3 instances of that in scikit-learn, so replacing those with a sane name (with a long timeline, say 4 numpy versions at least plus a major version number bump) doesn't seem unreasonable. Cheers, Ralf -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Jun 16 17:58:47 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 16 Jun 2018 14:58:47 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 16, 2018 at 11:02 AM Ralf Gommers wrote: > > > On Sat, Jun 16, 2018 at 12:38 AM, Robert Kern > wrote: > >> I have incorporated the feedback from this thread, and have significantly >> altered the proposal. I think this version will be more palatable to >> everyone. >> >> https://github.com/numpy/numpy/pull/11356 >> >> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst >> >> I'm pretty sure that Kevin Sheppard's prototype already implements the >> broad strokes of my proposal (seriously, he thinks of everything; I'm just >> playing catch up), so I don't think there is any technical risk. I think >> it's just a matter of the fine details of shoving this into numpy.random >> per se rather than a third party package. >> >> https://bashtage.github.io/randomgen/devel/legacy.html >> >> --- >> >> ============================== >> Random Number Generator Policy >> ============================== >> >> :Author: Robert Kern >> :Status: Draft >> :Type: Standards Track >> :Created: 2018-05-24 >> > > Thanks Robert. The whole proposal looks good to me now, just one minor > comment below. > > >> >> The initial release of the new PRNG subsystem MUST leave these convenience >> functions as aliases to the methods on a global ``RandomState`` that is >> initialized with a Mersenne Twister basic RNG object. A call to >> ``numpy.random.seed()`` will be forwarded to that basic RNG object. In >> order >> to allow certain workarounds, it MUST be possible to replace the basic RNG >> underneath the global ``RandomState`` with any other basic RNG object (we >> leave >> the precise API details up to the new subsystem). Calling >> ``numpy.random.seed()`` >> thereafter SHOULD just pass the given seed to the current basic RNG >> object and >> not attempt to reset the basic RNG to the Mersenne Twister. The global >> ``RandomState`` instance MUST be accessible by the name >> ``numpy.random.mtrand._rand``: Robert Kern long ago promised >> ``scikit-learn`` >> that this name would be stable. Whoops. >> > > This is a little weird; "mtrand" is an implementation detail already. > There's exactly 3 instances of that in scikit-learn, so replacing those > with a sane name (with a long timeline, say 4 numpy versions at least plus > a major version number bump) doesn't seem unreasonable. > Everything in this paragraph is explicitly just about the initial release with the new subsystem. A following paragraph says that we should revisit all of these in following releases. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sat Jun 16 23:55:12 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sat, 16 Jun 2018 20:55:12 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: > > This is a little weird; "mtrand" is an implementation detail already. >> There's exactly 3 instances of that in scikit-learn, so replacing those >> with a sane name (with a long timeline, say 4 numpy versions at least plus >> a major version number bump) doesn't seem unreasonable. >> > > Everything in this paragraph is explicitly just about the initial release > with the new subsystem. A following paragraph says that we should revisit > all of these in following releases. > This already read a little strangely to me -- it sounded like an indefinite pronouncement. It would be good to clarify :). Otherwise, I am quite happy with this NEP! It avoids unnecessary churn, and opens the door to much needed improvements in numpy.random. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sun Jun 17 00:34:01 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 16 Jun 2018 21:34:01 -0700 Subject: [Numpy-discussion] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 16, 2018 at 8:56 PM Stephan Hoyer wrote: > This is a little weird; "mtrand" is an implementation detail already. >>> There's exactly 3 instances of that in scikit-learn, so replacing those >>> with a sane name (with a long timeline, say 4 numpy versions at least plus >>> a major version number bump) doesn't seem unreasonable. >>> >> >> Everything in this paragraph is explicitly just about the initial release >> with the new subsystem. A following paragraph says that we should revisit >> all of these in following releases. >> > > This already read a little strangely to me -- it sounded like an > indefinite pronouncement. It would be good to clarify :). > Fair enough. How does this language strike you? https://github.com/numpy/numpy/pull/11356/commits/15af58f7b1358d430a1af3c12f34a5024735d072 -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From teoliphant at gmail.com Sun Jun 17 02:51:03 2018 From: teoliphant at gmail.com (Travis Oliphant) Date: Sun, 17 Jun 2018 01:51:03 -0500 Subject: [Numpy-discussion] A little about XND Message-ID: Hi everyone, I'm glad I'm able to contribute back to this discussion thread. I wanted to post a quick message to this group to make sure there is no mis-information about XND which has finally reached the point where it can be experimented with (http://xnd.io) and commented on. XND came out of thoughts and conversations we had at Continuum (now Anaconda) when thinking about cross-language array computing and how to enable improved features for high-level users in many languages (including Python, R, Ruby, Node, Scala, Rust, Go, etc.). Technically there are three projects that make up XND (thus the name Plures for the Github organization). All of these projects have a C-library and then a high-level interface (right now we only have resources to develop the Python interface but would love to see support for other languages). xnd (libxnd) is the typed container. ndtypes (libndtypes) is the (datashape-like) type system with a grammar, parser, and type matcher. gumath (libgumath) are generalized ufuncs which represent the entire function system on xnd. We will be talking more about XND in the coming months and years, but for the purposes of this list, I wanted to make it clear that 1) XND is not trying to replace NumPy. XND is a low-level library and intended to be such. It would be most welcome if someday NumPy uses XND. We understand this may be a while and certainly not before NumPy 2.0 or 3.0. 2) Our initial target users are Numba, pandas, Dask, xarray, and other higher-level objects at the moment. We are eagerly searching for integration opportunities to connect more developers (or advanced users) to xnd before making more progress. 3) We do discuss array-like things in the public channels. NumPy users and developers are welcome in those channels. Everything is done in public including the weekly meeting which anyone can attend: Weekly meeting: meet.google.com/heo-fmow-omz Live discussions: https://gitter.im/Plures/xnd for the libraries themselves https://gitter.im/Plures/xnd-ml for integrations. Issues and PRs: https://github.com/plures --- under the various projects. 4) We are thinking about adding a custom-dtype to NumPy that uses xnd and would be happy for anyone's help on that project. 5) We are in the early stages of exploring a high-level array interface (using the ideas of MoA and the Psi Calculus with Lenore Mullen who worked on APL years ago). Likely the first place this will see some initial progress is in an ND Sparse array that uses XND. We welcome participation and input from all. Stefan Krah has written the majority of the code and so we tend to respect his point of view. Pearu Peterson (of f2py and SciPy fame) has made some useful contributions recently. Stefan and I have been talking roughly weekly for a couple of years and so some of the problems currently there, I am certainly responsible for. Two of our immediate goals are to work with the Numba team to get support for ndtypes in Numba and allow Numba to use libgumath in no-python mode. I look forward to continuing the conversation with any of you who want to participate. Perhaps some of us can meet up during NumPy sprints to discuss more. XND is also currently looking for funding and time from interested parties to continue its development. -Travis -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sun Jun 17 20:47:02 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sun, 17 Jun 2018 20:47:02 -0400 Subject: [Numpy-discussion] A little about XND In-Reply-To: References: Message-ID: Hi Travis, More of a detailed question, but as we are currently thinking about extending the signature of gufuncs (i.e., things like `(m,n),(n,p)->(m,p)` for matrix multiplication), and as you must have thought about this for libgufunc, could you point me to how one would document the signature in your new system? (I briefly tried but there's no docs yet and I couldn't immediately find it in the code). If it is at all similar to numpy's and you have extended it, we should at least check whether we can do the same thing. Thanks, all best wishes, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From teoliphant at gmail.com Mon Jun 18 01:32:00 2018 From: teoliphant at gmail.com (Travis Oliphant) Date: Mon, 18 Jun 2018 00:32:00 -0500 Subject: [Numpy-discussion] A little about XND In-Reply-To: References: Message-ID: On Sun, Jun 17, 2018, 7:48 PM Marten van Kerkwijk wrote: > Hi Travis, > > More of a detailed question, but as we are currently thinking about > extending the signature of gufuncs (i.e., things like `(m,n),(n,p)->(m,p)` > for matrix multiplication), and as you must have thought about this for > libgufunc, could you point me to how one would document the signature in > your new system? (I briefly tried but there's no docs yet and I couldn't > immediately find it in the code). If it is at all similar to numpy's and > you have extended it, we should at least check whether we can do the same > thing. > I have been reading with interest these gufunc proposals and have pointed it out to the gumath devs. Right now, gumath doesn't go much beyond NumPy's syntax except for use of a more extensible type system. It uses the same notion of the dimension signature, though with a syntax derived from datashape which you can read more about here: http://datashape.readthedocs.io/en/latest/ Stefan Krah, Pearu, or Saul may have more comments. Thanks, -Travis > Thanks, all best wishes, > > Marten > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Mon Jun 18 09:58:38 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 18 Jun 2018 09:58:38 -0400 Subject: [Numpy-discussion] A little about XND In-Reply-To: References: Message-ID: Interesting. If nothing else, it would be a nice way to mark our internal functions, including the loops. It also should not be difficult to have (g)ufunc signatures exported in that way, combining `signature` and `types`. In more detail, I see the grammar clearly allows fixed dimensions in a way that easily translates, but it isn't immediately obvious to me how one would express broadcasting or possibly missing ones, so perhaps there is room for sharing how to indicate that (although it is at a higher level; the function signature is fine). -- Marten For others, direct link to datashape grammar: http://datashape.readthedocs.io/en/latest/grammar.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From skrah at bytereef.org Mon Jun 18 10:02:18 2018 From: skrah at bytereef.org (Stefan Krah) Date: Mon, 18 Jun 2018 16:02:18 +0200 Subject: [Numpy-discussion] A little about XND In-Reply-To: References: Message-ID: <20180618140218.GA17701@bytereef.org> On Sun, Jun 17, 2018 at 08:47:02PM -0400, Marten van Kerkwijk wrote: > More of a detailed question, but as we are currently thinking about > extending the signature of gufuncs (i.e., things like `(m,n),(n,p)->(m,p)` > for matrix multiplication), and as you must have thought about this for > libgufunc, could you point me to how one would document the signature in > your new system? (I briefly tried but there's no docs yet and I couldn't > immediately find it in the code). The docs are a bit scattered across the three libraries, here is something about types and pattern matching: http://ndtypes.readthedocs.io/en/latest/ndtypes/types.html http://ndtypes.readthedocs.io/en/latest/ndtypes/pattern-matching.html A couple of example signatures: https://github.com/plures/gumath/blob/5f1f6de3d2c9a003b9dfb224fe09c63ae81bf18b/libgumath/extending/quaternion.c#L121 https://github.com/plures/gumath/blob/5f1f6de3d2c9a003b9dfb224fe09c63ae81bf18b/libgumath/extending/pdist.c#L115 The function signature for float64-specialized matrix multiplication is: "... * N * M * float64, ... * M * P * float64 -> ... * N * P * float64" The function signature for generic matrix multiplication is: "... * N * M * T, ... * M * P * T -> ... * N * P * T" A function that only accepts scalars: "... * N * M * Scalar, ... * M * P * Scalar -> ... * N * P * Scalar" A couple of observations: Functions are multimethods, so function dispatch on concrete arguments works by trying to locate a matching kernel. For example, if only the above "float64" kernel is present, all other dtypes will fail. Casting ------- It is still under debate how we handle casting. The current examples libgumath/kernels simply generate *all* signatures that allow exact casting of the input for a specific function. This is feasible for unary and binary kernels, but could lead to case explosion for functions with many arguments. The kernel writer however is always free to use the above type variable or Scalar signatures and handle casting inside the kernel. Explicit gufuncs ---------------- Gufuncs are explicit and require leading ellipses. A signature of "N * M * float64" is not a gufunc and does not allow outer dimensions. Disable broadcasting -------------------- "D... * N * M * float64, D... * M * P * float64 -> D... * N * P * float64" Dimension variables match a sequence of dimensions, so in the above example all outer dimensions must be exactly the same. Non-symbolic matches -------------------- "... * 2 * 3 * int8" only accepts "2 * 3 * int8" as the inner dimensions. Sorry for the long mail, I hope this clears up a bit what function signatures generally look like. Stefan Krah From m.h.vankerkwijk at gmail.com Mon Jun 18 12:34:03 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 18 Jun 2018 12:34:03 -0400 Subject: [Numpy-discussion] A little about XND In-Reply-To: <20180618140218.GA17701@bytereef.org> References: <20180618140218.GA17701@bytereef.org> Message-ID: Hi Stefan, That looks quite nice and expressive. In the context of a discussion we have been having about describing `matmul/@` and possibly broadcastable dimensions, I think from your description it sounds like one would describe `@` with multiple functions (the multiple dispatch we have been (are?) considering as well): "... * N * M * T, ... * M * P * T -> ... * N * P * T" "M * T, ... * M * P * T -> ... P * T" "... * N * M * T, M * T -> ... * N * T" "M * T, M * T -> T" Is there a way to describe broadcasting? The sample case we've come up with is a function that calculates a weighted mean. This might take (values, sigmas) and return (mean, sigma_mean), which would imply a signature like: "... N * T, ... N * T -> ... * T, ... * T" But would your signature allow indicating that one could pass in a single sigma? I.e., broadcast the second 1 to N if needed? I realize that this is no longer about describing precisely what the function doing the calculation expects, but rather what an upper level is allowed to do before calling the function (i.e., take a dimension of 1 and broadcast it). All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From skrah at bytereef.org Mon Jun 18 15:09:50 2018 From: skrah at bytereef.org (Stefan Krah) Date: Mon, 18 Jun 2018 21:09:50 +0200 Subject: [Numpy-discussion] A little about XND In-Reply-To: References: <20180618140218.GA17701@bytereef.org> Message-ID: <20180618190950.GA3899@bytereef.org> Hi Marten, On Mon, Jun 18, 2018 at 12:34:03PM -0400, Marten van Kerkwijk wrote: > That looks quite nice and expressive. In the context of a discussion we > have been having about describing `matmul/@` and possibly broadcastable > dimensions, I think from your description it sounds like one would describe > `@` with multiple functions (the multiple dispatch we have been (are?) > considering as well): > > > "... * N * M * T, ... * M * P * T -> ... * N * P * T" > "M * T, ... * M * P * T -> ... P * T" > "... * N * M * T, M * T -> ... * N * T" > "M * T, M * T -> T" Yes, that's the way, and the outer dimensions (the part matched by the ellipsis) are always broadcast like in NumPy. > Is there a way to describe broadcasting? The sample case we've come up > with is a function that calculates a weighted mean. This might take > (values, sigmas) and return (mean, sigma_mean), which would imply a > signature like: > > "... N * T, ... N * T -> ... * T, ... * T" > > But would your signature allow indicating that one could pass in a single > sigma? I.e., broadcast the second 1 to N if needed? Actually I came across this today when implementing optimized matching for binary functions. I wanted the faster kernel "... * N * int64, ... * N * int64 -> ... * N * int64" to also match e.g. the input "int64, 10 * int64". The generic datashape spec would forbid this, but perhaps the '?' that you propose in nep-0020 would offer a way out of this for ndtypes. It's a bit confusing for datashape, since there is already a questionmark for missing variable dimensions (that have shape==0 in the data). >>> ndt("var * ?var * int64") ndt("var * ?var * int64") This would be the type for e.g. [[0], None, [1,2,3]]. But for symbolic dimensions (which only match fixed dimensions) perhaps this "... * ?N * int64, ... * ?N * int64 -> ... * ?N * int64" or, as in the NEP, "... * N? * int64, ... * N? * int64 -> ... * N? * int64" should mean "At least one input has ndim >= 1, broadcast as necessary". This still means that for the "all ndim==0" case one would need an additional kernel "int64, int64 -> int64". > I realize that this is no longer about describing precisely what the > function doing the calculation expects, but rather what an upper level is > allowed to do before calling the function (i.e., take a dimension of 1 and > broadcast it). Yes, for datashape the problem is that it also allows non-broadcastable signatures like "N * float64", really the same as "double x[]" in C. But the '?' with occasionally one additional kernel for ndim==0 could solve this. Stefan Krah From charlesr.harris at gmail.com Mon Jun 18 16:20:10 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 18 Jun 2018 14:20:10 -0600 Subject: [Numpy-discussion] rackspace ssl certificates Message-ID: Hi All, I've been trying to put out the NumPy 1.15.0rc1, but cannot get `numpy-wheels` to upload the wheels to rackspace on windows, there is a certification problem. I note that that requirement was supposedly disabled: on_success: # Upload the generated wheel package to Rackspace # On Windows, Apache Libcloud cannot find a standard CA cert bundle so we # disable the ssl checks. and nothing relevant seems to have changed in our `.appveyor.yml` since the last successful run 7 days ago, 6 if we count 1.14.5, so I'm thinking a policy has changed at either at rackspace or appveyor, but that is just a guess. I'm experimenting with various changes to the script and the `apache-libcloud` version to see if I can get success, but thought I'd ask if anyone knew anything that might be helpful. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan12343 at gmail.com Mon Jun 18 16:22:01 2018 From: nathan12343 at gmail.com (Nathan Goldbaum) Date: Mon, 18 Jun 2018 15:22:01 -0500 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: I think Matthew Brett needs to fix this. On Mon, Jun 18, 2018 at 3:20 PM Charles R Harris wrote: > Hi All, > > I've been trying to put out the NumPy 1.15.0rc1, but cannot get > `numpy-wheels` to upload the wheels to rackspace on windows, there is a > certification problem. I note that that requirement was supposedly disabled: > > on_success: > # Upload the generated wheel package to Rackspace > # On Windows, Apache Libcloud cannot find a standard CA cert bundle so we > # disable the ssl checks. > > and nothing relevant seems to have changed in our `.appveyor.yml` since > the last successful run 7 days ago, 6 if we count 1.14.5, so I'm thinking a > policy has changed at either at rackspace or appveyor, but that is just a > guess. I'm experimenting with various changes to the script and the > `apache-libcloud` version to see if I can get success, but thought I'd ask > if anyone knew anything that might be helpful. > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jun 18 16:42:31 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 18 Jun 2018 14:42:31 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum wrote: > I think Matthew Brett needs to fix this. > That would be nice, but I'm not convinced it is helpful :) I note that latest `apache-libcloud` does not install directly on windows, there seem to be some missing dependencies. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Mon Jun 18 17:13:21 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Mon, 18 Jun 2018 22:13:21 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: Hi, On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris wrote: > > > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum > wrote: >> >> I think Matthew Brett needs to fix this. > > > That would be nice, but I'm not convinced it is helpful :) I note that > latest `apache-libcloud` does not install directly on windows, there seem to > be some missing dependencies.> I'm happy to give it a go - Chuck - can I cancel the various builds running on my account, so I can do some debugging. Cheers, Matthew From charlesr.harris at gmail.com Mon Jun 18 19:24:41 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 18 Jun 2018 17:24:41 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett wrote: > Hi, > > On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris > wrote: > > > > > > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum > > wrote: > >> > >> I think Matthew Brett needs to fix this. > > > > > > That would be nice, but I'm not convinced it is helpful :) I note that > > latest `apache-libcloud` does not install directly on windows, there > seem to > > be some missing dependencies.> > > I'm happy to give it a go - Chuck - can I cancel the various builds > running on my account, so I can do some debugging. > Absolutely! Nuke those suckers ... Chuck > > Cheers, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Mon Jun 18 19:58:28 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 19 Jun 2018 00:58:28 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris wrote: > > > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris >> wrote: >> > >> > >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum >> > wrote: >> >> >> >> I think Matthew Brett needs to fix this. >> > >> > >> > That would be nice, but I'm not convinced it is helpful :) I note that >> > latest `apache-libcloud` does not install directly on windows, there >> > seem to >> > be some missing dependencies.> >> >> I'm happy to give it a go - Chuck - can I cancel the various builds >> running on my account, so I can do some debugging. > > > Absolutely! Nuke those suckers ... Hmm - I just tried installing certifi to get the SSL certificates, and removed --no-ssl-check. I wonder if something changed in the Rackspace protocols, or something. In case it's useful, I'm using a little repo that runs an Appveyor job then drops into an RDP server for me to log into, with the relevant bit here: https://github.com/matthew-brett/appvfutz/blob/master/appveyor.yml#L24 See: https://www.gep13.co.uk/blog/how-to-use-appveyor-remote-desktop-connection That said, maybe the fix doesn't work, let's wait on the builds. Cheers, Matthew From m.h.vankerkwijk at gmail.com Mon Jun 18 21:04:19 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Mon, 18 Jun 2018 21:04:19 -0400 Subject: [Numpy-discussion] A little about XND In-Reply-To: <20180618190950.GA3899@bytereef.org> References: <20180618140218.GA17701@bytereef.org> <20180618190950.GA3899@bytereef.org> Message-ID: Hi Stefan, Just to clarify: the ? we propose in the NEP is really for matmul - it indicates a true missing dimension (i.e., the array cannot have outer broadcast dimensions as well). For inner loop broadcasting, I'm proposing a "|1" post-fix, which means a dimension could also be missing, but can also be there and be 1, in which case it can do outer broadcast as well. So, for your function in your notation, it might look like: "... * N|1 * int64, ... * N|1 * int64 -> ... * N * int64" (Note that the output of course always has N - if both inputs have 1 then N=1; it is not meant to be absent). I think that actually looks quite clear, although perhaps one might want parentheses around it (since "|" = "or" normally does not have precedence over "*" = multiply), i.e., "... * (N|1) * int64, ... * (N|1) * int64 -> ... * N * int64" All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Mon Jun 18 21:44:16 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Mon, 18 Jun 2018 19:44:16 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Mon, Jun 18, 2018 at 5:58 PM, Matthew Brett wrote: > On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris > wrote: > > > > > > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum < > nathan12343 at gmail.com> > >> > wrote: > >> >> > >> >> I think Matthew Brett needs to fix this. > >> > > >> > > >> > That would be nice, but I'm not convinced it is helpful :) I note that > >> > latest `apache-libcloud` does not install directly on windows, there > >> > seem to > >> > be some missing dependencies.> > >> > >> I'm happy to give it a go - Chuck - can I cancel the various builds > >> running on my account, so I can do some debugging. > > > > > > Absolutely! Nuke those suckers ... > > Hmm - I just tried installing certifi to get the SSL certificates, and > removed --no-ssl-check. I wonder if something changed in the > Rackspace protocols, or something. > > In case it's useful, I'm using a little repo that runs an Appveyor job > then drops into an RDP server for me to log into, with the relevant > bit here: > > https://github.com/matthew-brett/appvfutz/blob/master/appveyor.yml#L24 > > See: https://www.gep13.co.uk/blog/how-to-use-appveyor-remote- > desktop-connection > > That said, maybe the fix doesn't work, let's wait on the builds. > > Looks like that fixes the problem. Probably scipy-wheels will need that fix also. Do you know if new wheels with the same name will overwrite the old ones? ISTR that that is the case. BTW, there don't seem to be any nightly builds, does something need reconfiguration? Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Tue Jun 19 06:57:05 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 19 Jun 2018 11:57:05 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: Hi, On Tue, Jun 19, 2018 at 2:44 AM, Charles R Harris wrote: > > > On Mon, Jun 18, 2018 at 5:58 PM, Matthew Brett > wrote: >> >> On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris >> wrote: >> > >> > >> > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett >> > wrote: >> >> >> >> Hi, >> >> >> >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris >> >> wrote: >> >> > >> >> > >> >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum >> >> > >> >> > wrote: >> >> >> >> >> >> I think Matthew Brett needs to fix this. >> >> > >> >> > >> >> > That would be nice, but I'm not convinced it is helpful :) I note >> >> > that >> >> > latest `apache-libcloud` does not install directly on windows, there >> >> > seem to >> >> > be some missing dependencies.> >> >> >> >> I'm happy to give it a go - Chuck - can I cancel the various builds >> >> running on my account, so I can do some debugging. >> > >> > >> > Absolutely! Nuke those suckers ... >> >> Hmm - I just tried installing certifi to get the SSL certificates, and >> removed --no-ssl-check. I wonder if something changed in the >> Rackspace protocols, or something. >> >> In case it's useful, I'm using a little repo that runs an Appveyor job >> then drops into an RDP server for me to log into, with the relevant >> bit here: >> >> https://github.com/matthew-brett/appvfutz/blob/master/appveyor.yml#L24 >> >> See: >> https://www.gep13.co.uk/blog/how-to-use-appveyor-remote-desktop-connection >> >> That said, maybe the fix doesn't work, let's wait on the builds. >> > > Looks like that fixes the problem. Probably scipy-wheels will need that fix > also. I put it in. > Do you know if new wheels with the same name will overwrite the old > ones? ISTR that that is the case. Right - they overwrite the old ones. > BTW, there don't seem to be any nightly builds, does something need > reconfiguration? For Appveyor? You need a cron-enabled account. My account is enabled, I just emailed the appveyor support with my username, and an explanation. Maybe worth doing the same for the numpy account? Thereafter, you can just enter the cron time string in the settings, to enable daily builds. Cheers, Matthew From cimrman3 at ntc.zcu.cz Tue Jun 19 07:52:31 2018 From: cimrman3 at ntc.zcu.cz (Robert Cimrman) Date: Tue, 19 Jun 2018 13:52:31 +0200 Subject: [Numpy-discussion] ANN: SfePy 2018.2 Message-ID: <31b88a31-b853-4a00-415a-55f935472ab3@ntc.zcu.cz> I am pleased to announce release 2018.2 of SfePy. Description ----------- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -------------------------- - generalized-alpha and velocity Verlet elastodynamics solvers - terms for dispersion in fluids - caching of reference coordinates for faster repeated use of probes - new wrapper of MUMPS linear solver for parallel runs For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Cheers, Robert Cimrman --- Contributors to this release in alphabetical order: Robert Cimrman Lubos Kejzlar Vladimir Lukes Matyas Novak From charlesr.harris at gmail.com Tue Jun 19 09:46:08 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 19 Jun 2018 07:46:08 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Tue, Jun 19, 2018 at 4:57 AM, Matthew Brett wrote: > Hi, > > On Tue, Jun 19, 2018 at 2:44 AM, Charles R Harris > wrote: > > > > > > On Mon, Jun 18, 2018 at 5:58 PM, Matthew Brett > > wrote: > >> > >> On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett < > matthew.brett at gmail.com> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris > >> >> wrote: > >> >> > > >> >> > > >> >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum > >> >> > > >> >> > wrote: > >> >> >> > >> >> >> I think Matthew Brett needs to fix this. > >> >> > > >> >> > > >> >> > That would be nice, but I'm not convinced it is helpful :) I note > >> >> > that > >> >> > latest `apache-libcloud` does not install directly on windows, > there > >> >> > seem to > >> >> > be some missing dependencies.> > >> >> > >> >> I'm happy to give it a go - Chuck - can I cancel the various builds > >> >> running on my account, so I can do some debugging. > >> > > >> > > >> > Absolutely! Nuke those suckers ... > >> > >> Hmm - I just tried installing certifi to get the SSL certificates, and > >> removed --no-ssl-check. I wonder if something changed in the > >> Rackspace protocols, or something. > >> > >> In case it's useful, I'm using a little repo that runs an Appveyor job > >> then drops into an RDP server for me to log into, with the relevant > >> bit here: > >> > >> https://github.com/matthew-brett/appvfutz/blob/master/appveyor.yml#L24 > >> > >> See: > >> https://www.gep13.co.uk/blog/how-to-use-appveyor-remote- > desktop-connection > >> > >> That said, maybe the fix doesn't work, let's wait on the builds. > >> > > > > Looks like that fixes the problem. Probably scipy-wheels will need that > fix > > also. > > I put it in. > > > Do you know if new wheels with the same name will overwrite the old > > ones? ISTR that that is the case. > > Right - they overwrite the old ones. > > > BTW, there don't seem to be any nightly builds, does something need > > reconfiguration? > > For Appveyor? You need a cron-enabled account. My account is > enabled, I just emailed the appveyor support with my username, and an > explanation. Maybe worth doing the same for the numpy account? > Thereafter, you can just enter the cron time string in the settings, > to enable daily builds. > > What I was curious about is that there were no more "daily" builds of master. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Tue Jun 19 12:36:29 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 19 Jun 2018 17:36:29 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Tue, Jun 19, 2018 at 2:46 PM, Charles R Harris wrote: > > > On Tue, Jun 19, 2018 at 4:57 AM, Matthew Brett > wrote: >> >> Hi, >> >> On Tue, Jun 19, 2018 at 2:44 AM, Charles R Harris >> wrote: >> > >> > >> > On Mon, Jun 18, 2018 at 5:58 PM, Matthew Brett >> > wrote: >> >> >> >> On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris >> >> wrote: >> >> > >> >> > >> >> > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett >> >> > >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris >> >> >> wrote: >> >> >> > >> >> >> > >> >> >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum >> >> >> > >> >> >> > wrote: >> >> >> >> >> >> >> >> I think Matthew Brett needs to fix this. >> >> >> > >> >> >> > >> >> >> > That would be nice, but I'm not convinced it is helpful :) I note >> >> >> > that >> >> >> > latest `apache-libcloud` does not install directly on windows, >> >> >> > there >> >> >> > seem to >> >> >> > be some missing dependencies.> >> >> >> >> >> >> I'm happy to give it a go - Chuck - can I cancel the various builds >> >> >> running on my account, so I can do some debugging. >> >> > >> >> > >> >> > Absolutely! Nuke those suckers ... >> >> >> >> Hmm - I just tried installing certifi to get the SSL certificates, and >> >> removed --no-ssl-check. I wonder if something changed in the >> >> Rackspace protocols, or something. >> >> >> >> In case it's useful, I'm using a little repo that runs an Appveyor job >> >> then drops into an RDP server for me to log into, with the relevant >> >> bit here: >> >> >> >> https://github.com/matthew-brett/appvfutz/blob/master/appveyor.yml#L24 >> >> >> >> See: >> >> >> >> https://www.gep13.co.uk/blog/how-to-use-appveyor-remote-desktop-connection >> >> >> >> That said, maybe the fix doesn't work, let's wait on the builds. >> >> >> > >> > Looks like that fixes the problem. Probably scipy-wheels will need that >> > fix >> > also. >> >> I put it in. >> >> > Do you know if new wheels with the same name will overwrite the old >> > ones? ISTR that that is the case. >> >> Right - they overwrite the old ones. >> >> > BTW, there don't seem to be any nightly builds, does something need >> > reconfiguration? >> >> For Appveyor? You need a cron-enabled account. My account is >> enabled, I just emailed the appveyor support with my username, and an >> explanation. Maybe worth doing the same for the numpy account? >> Thereafter, you can just enter the cron time string in the settings, >> to enable daily builds. >> > > What I was curious about is that there were no more "daily" builds of > master. Is that right? That there were daily builds of master, on Appveyor? I don't know how those worked, I only recently got cron permission ... Cheers, Matthew From charlesr.harris at gmail.com Tue Jun 19 12:58:03 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Tue, 19 Jun 2018 10:58:03 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Tue, Jun 19, 2018 at 10:36 AM, Matthew Brett wrote: > On Tue, Jun 19, 2018 at 2:46 PM, Charles R Harris > wrote: > > > > > > On Tue, Jun 19, 2018 at 4:57 AM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Tue, Jun 19, 2018 at 2:44 AM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Mon, Jun 18, 2018 at 5:58 PM, Matthew Brett < > matthew.brett at gmail.com> > >> > wrote: > >> >> > >> >> On Tue, Jun 19, 2018 at 12:24 AM, Charles R Harris > >> >> wrote: > >> >> > > >> >> > > >> >> > On Mon, Jun 18, 2018 at 3:13 PM, Matthew Brett > >> >> > > >> >> > wrote: > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> On Mon, Jun 18, 2018 at 9:42 PM, Charles R Harris > >> >> >> wrote: > >> >> >> > > >> >> >> > > >> >> >> > On Mon, Jun 18, 2018 at 2:22 PM, Nathan Goldbaum > >> >> >> > > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> I think Matthew Brett needs to fix this. > >> >> >> > > >> >> >> > > >> >> >> > That would be nice, but I'm not convinced it is helpful :) I > note > >> >> >> > that > >> >> >> > latest `apache-libcloud` does not install directly on windows, > >> >> >> > there > >> >> >> > seem to > >> >> >> > be some missing dependencies.> > >> >> >> > >> >> >> I'm happy to give it a go - Chuck - can I cancel the various > builds > >> >> >> running on my account, so I can do some debugging. > >> >> > > >> >> > > >> >> > Absolutely! Nuke those suckers ... > >> >> > >> >> Hmm - I just tried installing certifi to get the SSL certificates, > and > >> >> removed --no-ssl-check. I wonder if something changed in the > >> >> Rackspace protocols, or something. > >> >> > >> >> In case it's useful, I'm using a little repo that runs an Appveyor > job > >> >> then drops into an RDP server for me to log into, with the relevant > >> >> bit here: > >> >> > >> >> https://github.com/matthew-brett/appvfutz/blob/master/ > appveyor.yml#L24 > >> >> > >> >> See: > >> >> > >> >> https://www.gep13.co.uk/blog/how-to-use-appveyor-remote- > desktop-connection > >> >> > >> >> That said, maybe the fix doesn't work, let's wait on the builds. > >> >> > >> > > >> > Looks like that fixes the problem. Probably scipy-wheels will need > that > >> > fix > >> > also. > >> > >> I put it in. > >> > >> > Do you know if new wheels with the same name will overwrite the old > >> > ones? ISTR that that is the case. > >> > >> Right - they overwrite the old ones. > >> > >> > BTW, there don't seem to be any nightly builds, does something need > >> > reconfiguration? > >> > >> For Appveyor? You need a cron-enabled account. My account is > >> enabled, I just emailed the appveyor support with my username, and an > >> explanation. Maybe worth doing the same for the numpy account? > >> Thereafter, you can just enter the cron time string in the settings, > >> to enable daily builds. > >> > > > > What I was curious about is that there were no more "daily" builds of > > master. > > Is that right? That there were daily builds of master, on Appveyor? > I don't know how those worked, I only recently got cron permission ... > No, but there used to be daily builds on travis. They stopped 8 days ago, https://travis-ci.org/MacPython/numpy-wheels/builds. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Jun 19 13:27:39 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 19 Jun 2018 10:27:39 -0700 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Tue Jun 19 13:57:31 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 19 Jun 2018 18:57:31 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: Hi, On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus wrote: > On 19/06/18 09:58, Charles R Harris wrote: >> >> > What I was curious about is that there were no more "daily" builds of >> > master. >> >> Is that right? That there were daily builds of master, on Appveyor? >> I don't know how those worked, I only recently got cron permission ... > > > No, but there used to be daily builds on travis. They stopped 8 days ago, > https://travis-ci.org/MacPython/numpy-wheels/builds. Oops - yes - sorry - I retired the 'daily' branch, in favor of 'master', but forgot to update the Travis-CI settings. Done now. Cheers, Matthew From sidky at uchicago.edu Tue Jun 19 13:51:12 2018 From: sidky at uchicago.edu (Emil Sidky) Date: Tue, 19 Jun 2018 12:51:12 -0500 Subject: [Numpy-discussion] question about array slicing and element assignment Message-ID: <88fa73eb-4ef9-c0f8-09f1-32b967b79b15@uchicago.edu> Hello, The following is an example where an array element assignment didn't work as I expected. Create a 6 x 3 matrix: In [70]: a =? randn(6,3) In [71]: a Out[71]: array([[ 1.73266816,? 0.948849? ,? 0.69188222], ?????? [-0.61840161, -0.03449826,? 0.15032552], ?????? [ 0.4963306 ,? 0.77028209, -0.63076396], ?????? [-1.92273602, -1.03146536,? 0.27744612], ?????? [ 0.70736325,? 1.54687964, -0.75573888], ?????? [ 0.16316043, -0.34814532,? 0.3683143 ]]) Create a 3x3 boolean array: In [72]: mask = randn(3,3)>0. In [73]: mask Out[73]: array([[ True,? True,? True], ?????? [False,? True,? True], ?????? [ True, False,? True]], dtype=bool) Try to modify elements of "a" with the following line: In [74]: a[(2,3,5),][mask] = 1. No elements are changed in "a": In [75]: a Out[75]: array([[ 1.73266816,? 0.948849? ,? 0.69188222], ?????? [-0.61840161, -0.03449826,? 0.15032552], ?????? [ 0.4963306 ,? 0.77028209, -0.63076396], ?????? [-1.92273602, -1.03146536,? 0.27744612], ?????? [ 0.70736325,? 1.54687964, -0.75573888], ?????? [ 0.16316043, -0.34814532,? 0.3683143 ]]) Instead try to modify elements of "a" with this line: In [76]: a[::2,][mask] = 1. This time it works: In [77]: a Out[77]: array([[ 1.??????? ,? 1.??????? ,? 1.??????? ], ?????? [-0.61840161, -0.03449826,? 0.15032552], ?????? [ 0.4963306 ,? 1.??????? ,? 1.??????? ], ?????? [-1.92273602, -1.03146536,? 0.27744612], ?????? [ 1.??????? ,? 1.54687964,? 1.??????? ], ?????? [ 0.16316043, -0.34814532,? 0.3683143 ]]) Is there a way where I can modify the elements of "a" selected by an expression like "a[(2,3,5),][mask]" ? Thanks , Emil From shoyer at gmail.com Tue Jun 19 17:09:24 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 19 Jun 2018 14:09:24 -0700 Subject: [Numpy-discussion] question about array slicing and element assignment In-Reply-To: <88fa73eb-4ef9-c0f8-09f1-32b967b79b15@uchicago.edu> References: <88fa73eb-4ef9-c0f8-09f1-32b967b79b15@uchicago.edu> Message-ID: You will need to convert "a[(2,3,5),][mask]" into a single indexing expression, e.g, by using utility functions like np.nonzero() on mask. NumPy can't support assignment in chained indexing. On Tue, Jun 19, 2018 at 1:25 PM Emil Sidky wrote: > Hello, > The following is an example where an array element assignment didn't work > as I expected. > Create a 6 x 3 matrix: > > In [70]: a = randn(6,3) > > In [71]: a > Out[71]: > array([[ 1.73266816, 0.948849 , 0.69188222], > [-0.61840161, -0.03449826, 0.15032552], > [ 0.4963306 , 0.77028209, -0.63076396], > [-1.92273602, -1.03146536, 0.27744612], > [ 0.70736325, 1.54687964, -0.75573888], > [ 0.16316043, -0.34814532, 0.3683143 ]]) > > Create a 3x3 boolean array: > In [72]: mask = randn(3,3)>0. > > In [73]: mask > Out[73]: > array([[ True, True, True], > [False, True, True], > [ True, False, True]], dtype=bool) > > Try to modify elements of "a" with the following line: > In [74]: a[(2,3,5),][mask] = 1. > No elements are changed in "a": > In [75]: a > Out[75]: > array([[ 1.73266816, 0.948849 , 0.69188222], > [-0.61840161, -0.03449826, 0.15032552], > [ 0.4963306 , 0.77028209, -0.63076396], > [-1.92273602, -1.03146536, 0.27744612], > [ 0.70736325, 1.54687964, -0.75573888], > [ 0.16316043, -0.34814532, 0.3683143 ]]) > > Instead try to modify elements of "a" with this line: > In [76]: a[::2,][mask] = 1. > > This time it works: > In [77]: a > Out[77]: > array([[ 1. , 1. , 1. ], > [-0.61840161, -0.03449826, 0.15032552], > [ 0.4963306 , 1. , 1. ], > [-1.92273602, -1.03146536, 0.27744612], > [ 1. , 1.54687964, 1. ], > [ 0.16316043, -0.34814532, 0.3683143 ]]) > > > Is there a way where I can modify the elements of "a" selected by an > expression like "a[(2,3,5),][mask]" ? > > Thanks , Emil > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From diagonaldevice at gmail.com Tue Jun 19 19:37:21 2018 From: diagonaldevice at gmail.com (Michael Lamparski) Date: Tue, 19 Jun 2018 19:37:21 -0400 Subject: [Numpy-discussion] Forcing new dimensions to appear at front in advanced indexing Message-ID: Hi all, So, in advanced indexing, numpy decides where to put new axes based on whether the "advanced indices" are all next to each other. >>> np.random.random((3,4,5,6,7,8))[:, [[0,0],[0,0]], 1, :].shape (3, 2, 2, 6, 7, 8) >>> np.random.random((3,4,5,6,7,8))[:, [[0,0],[0,0]], :, 1].shape (2, 2, 3, 5, 7, 8) In creating a wrapper type around arrays, I'm finding myself needing to suppress this behavior, so that the new axes consistently appear in the front. I thought of a dumb hat trick: def index(x, indices): return x[(True, None) + indices] Which certainly gets the new dimensions where I want them, but it introduces a ghost dimension of 1 (and sometimes two such dimensions!) in a place where I'm not sure I can easily find it. >>> np.random.random((3,4,5,6,7,8))[True, None, 1].shape (1, 1, 4, 5, 6, 7, 8) >>> np.random.random((3,4,5,6,7,8))[True, None, :, [[0,0],[0,0]], 1, :].shape (2, 2, 1, 3, 6, 7, 8) >>> np.random.random((3,4,5,6,7,8))[True, None, :, [[0,0],[0,0]], :, 1].shape (2, 2, 1, 3, 5, 7, 8) any better ideas? --- Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Jun 20 05:34:42 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 20 Jun 2018 11:34:42 +0200 Subject: [Numpy-discussion] Forcing new dimensions to appear at front in advanced indexing In-Reply-To: References: Message-ID: On Tue, 2018-06-19 at 19:37 -0400, Michael Lamparski wrote: > Hi all, > > So, in advanced indexing, numpy decides where to put new axes based > on whether the "advanced indices" are all next to each other. > > >>> np.random.random((3,4,5,6,7,8))[:, [[0,0],[0,0]], 1, :].shape > (3, 2, 2, 6, 7, 8) > >>> np.random.random((3,4,5,6,7,8))[:, [[0,0],[0,0]], :, 1].shape > (2, 2, 3, 5, 7, 8) > > In creating a wrapper type around arrays, I'm finding myself needing > to suppress this behavior, so that the new axes consistently appear > in the front. I thought of a dumb hat trick: > > def index(x, indices): > return x[(True, None) + indices] > > Which certainly gets the new dimensions where I want them, but it > introduces a ghost dimension of 1 (and sometimes two such > dimensions!) in a place where I'm not sure I can easily find it. > > >>> np.random.random((3,4,5,6,7,8))[True, None, 1].shape > (1, 1, 4, 5, 6, 7, 8) > >>> np.random.random((3,4,5,6,7,8))[True, None, :, [[0,0],[0,0]], 1, > :].shape > (2, 2, 1, 3, 6, 7, 8) > >>> np.random.random((3,4,5,6,7,8))[True, None, :, [[0,0],[0,0]], :, > 1].shape > (2, 2, 1, 3, 5, 7, 8) > > any better ideas? > We have proposed `arr.vindex[...]` to do this and there are is a pure python implementation of it out there, I think it may be linked here somewhere: https://github.com/numpy/numpy/pull/6256 There is a way that will generally work using triple indexing: arr[..., None, None][orig_indx * (slice(None), np.array(0))][..., 0] The first and last indexing operation is just a view creation, so it is basically a no-op. Now doing this gives me the shiver, but it will work always. If you want to have a no-copy behaviour in case your original index is ont an advanced indexing operation, you should replace the np.array(0) with just 0. - Sebastian > --- > > Michael > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From diagonaldevice at gmail.com Wed Jun 20 09:15:27 2018 From: diagonaldevice at gmail.com (Michael Lamparski) Date: Wed, 20 Jun 2018 09:15:27 -0400 Subject: [Numpy-discussion] Forcing new dimensions to appear at front in advanced indexing In-Reply-To: References: Message-ID: > There is a way that will generally work using triple indexing: > > arr[..., None, None][orig_indx + (slice(None), np.array(0))][..., 0] Impressive! (note: I fixed the * typo in the quote) > The first and last indexing operation is just a view creation, so it is > basically a no-op. Now doing this gives me the shiver, but it will work > always. If you want to have a no-copy behaviour in case your original > index is ont an advanced indexing operation, you should replace the > np.array(0) with just 0. I agree about the shivers, but any workaround is good to have nonetheless. If the index is not an advanced indexing operation, does it not suffice to simply apply the index tuple as-is? Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Jun 20 09:30:49 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 20 Jun 2018 15:30:49 +0200 Subject: [Numpy-discussion] Forcing new dimensions to appear at front in advanced indexing In-Reply-To: References: Message-ID: On Wed, 2018-06-20 at 09:15 -0400, Michael Lamparski wrote: > > There is a way that will generally work using triple indexing: > > > > arr[..., None, None][orig_indx + (slice(None), np.array(0))][..., > 0] > > Impressive! (note: I fixed the * typo in the quote) > > > The first and last indexing operation is just a view creation, so > it is > > basically a no-op. Now doing this gives me the shiver, but it will > work > > always. If you want to have a no-copy behaviour in case your > original > > index is ont an advanced indexing operation, you should replace the > > np.array(0) with just 0. > > I agree about the shivers, but any workaround is good to have > nonetheless. > > If the index is not an advanced indexing operation, does it not > suffice to simply apply the index tuple as-is? Yes, with the `np.array(0)` however, the result will forced to be a copy and not a view into the original array, when writing the line first I thought of "force advanced indexing", which there is likely no reason for though. If you replace it with 0, the result will be an identical view when the index is not advanced (with only a tiny bit of call overhead). So it might be nice to just use 0 instead, since if your index is advanced indexing, there is no difference between the two. But then you do not have to check if there is advanced indexing going on at all. Btw. if you want to use it for an object, I might suggest to actually use: object.vindex[...] notation for this logic (requires a slightly annoying helper class). The NEP is basically just a draft/proposal status, but xarray is already using that indexing method/property IIRC, so that name is relatively certain by now. I frankly am not sure right now if the vindex proposal was with a forced copy or not, probably it was. - Sebastian > > Michael > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From matti.picus at gmail.com Thu Jun 21 12:25:31 2018 From: matti.picus at gmail.com (Matti Picus) Date: Thu, 21 Jun 2018 09:25:31 -0700 Subject: [Numpy-discussion] Remove sctypeNA and typeNA from numpy core Message-ID: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> numpy.core has many ways to catalogue dtype names: sctypeDict, typeDict (which is precisely sctypeDict), typecodes, and typename. We also generate sctypeNA and typeNA but, as issue 11241 shows, it is sometimes wrong. They are also not documented and never used inside numpy. Instead of fixing it, I propose to remove sctypeNA and typeNA. Any thoughts or objections? Matti From matti.picus at gmail.com Thu Jun 21 13:31:50 2018 From: matti.picus at gmail.com (Matti Picus) Date: Thu, 21 Jun 2018 10:31:50 -0700 Subject: [Numpy-discussion] Remove sctypeNA and typeNA from numpy core In-Reply-To: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> References: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> Message-ID: <10d9d518-5075-8b9c-2383-792a1589a9e0@gmail.com> On 21/06/18 09:25, Matti Picus wrote: > numpy.core has many ways to catalogue dtype names: sctypeDict, > typeDict (which is precisely sctypeDict), typecodes, and typename. We > also generate sctypeNA and typeNA but, as issue 11241 shows, it is > sometimes wrong. They are also not documented and never used inside > numpy. Instead of fixing it, I propose to remove sctypeNA and typeNA. > > Any thoughts or objections? > Matti Whoops? 11340 (not 11241) which has been merged. Matti From charlesr.harris at gmail.com Thu Jun 21 13:34:08 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Thu, 21 Jun 2018 11:34:08 -0600 Subject: [Numpy-discussion] NumPy 1.15.0rc1 released Message-ID: Hi All, On behalf of the NumPy team I'm pleased to announce the release of NumPy 1.15.0rc1. This release has an unusual number of cleanups, many deprecations of old functions, and improvements to many existing functions. A total of 423 pull reguests were merged for this release, please look at the release notes for details. Some highlights are: - NumPy has switched to pytest for testing. - A new `numpy.printoptions` context manager. - Many improvements to the histogram functions. - Support for unicode field names in python 2.7. - Improved support for PyPy. The Python versions supported by this release are 2.7, 3.4-3.6. The wheels are linked with OpenBLAS 3.0, which should fix some of the linalg problems reported for NumPy 1.14, and the source archives were created using Cython 0.28.2 and should work with the upcoming Python 3.7. Wheels for this release can be downloaded from PyPI , source archives are available from Github . A total of 128 people contributed to this release. People with a "+" by their names contributed a patch for the first time. - Aaron Critchley + - Aarthi + - Aarthi Agurusa + - Alex Thomas + - Alexander Belopolsky - Allan Haldane - Anas Khan + - Andras Deak - Andrey Portnoy + - Anna Chiara - Aurelien Jarno + - Baurzhan Muftakhidinov - Berend Kapelle + - Bernhard M. Wiedemann - Bjoern Thiel + - Bob Eldering - Cenny Wenner + - Charles Harris - ChloeColeongco + - Chris Billington + - Christopher + - Chun-Wei Yuan + - Claudio Freire + - Daniel Smith - Darcy Meyer + - David Abdurachmanov + - David Freese - Deepak Kumar Gouda + - Dennis Weyland + - Derrick Williams + - Dmitriy Shalyga + - Eric Cousineau + - Eric Larson - Eric Wieser - Evgeni Burovski - Frederick Lefebvre + - Gaspar Karm + - Geoffrey Irving - Gerhard Hobler + - Gerrit Holl - Guo Ci + - Hameer Abbasi + - Han Shen - Hiroyuki V. Yamazaki + - Hong Xu - Ihor Melnyk + - Jaime Fernandez - Jake VanderPlas + - James Tocknell + - Jarrod Millman - Jeff VanOss + - John Kirkham - Jonas Rauber + - Jonathan March + - Joseph Fox-Rabinovitz - Julian Taylor - Junjie Bai + - Juris Bogusevs + - J?rg D?pfert - Kenichi Maehashi + - Kevin Sheppard - Kimikazu Kato + - Kirit Thadaka + - Kritika Jalan + - Lakshay Garg + - Lars G + - Licht Takeuchi - Louis Potok + - Luke Zoltan Kelley - MSeifert04 + - Mads R. B. Kristensen + - Malcolm Smith + - Mark Harfouche + - Marten H. van Kerkwijk + - Marten van Kerkwijk - Matheus Vieira Portela + - Mathieu Lamarre - Mathieu Sornay + - Matthew Brett - Matthew Rocklin + - Matthias Bussonnier - Matti Picus - Michael Droettboom - Miguel S?nchez de Le?n Peque + - Mike Toews + - Milo + - Nathaniel J. Smith - Nelle Varoquaux - Nicholas Nadeau + - Nick Minkyu Lee + - Nikita + - Nikita Kartashov + - Nils Becker + - Oleg Zabluda - Orestis Floros + - Pat Gunn + - Paul van Mulbregt + - Pauli Virtanen - Pierre Chanial + - Ralf Gommers - Raunak Shah + - Robert Kern - Russell Keith-Magee + - Ryan Soklaski + - Samuel Jackson + - Sebastian Berg - Siavash Eliasi + - Simon Conseil - Simon Gibbons - Stefan Krah + - Stefan van der Walt - Stephan Hoyer - Subhendu + - Subhendu Ranjan Mishra + - Tai-Lin Wu + - Tobias Fischer + - Toshiki Kataoka + - Tyler Reddy + - Varun Nayyar - Victor Rodriguez + - Warren Weckesser - Zane Bradley + - fo40225 - lumbric + - luzpaz + - mamrehn + - tynn + - xoviat Cheers Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Thu Jun 21 14:07:04 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Thu, 21 Jun 2018 20:07:04 +0200 Subject: [Numpy-discussion] Remove sctypeNA and typeNA from numpy core In-Reply-To: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> References: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> Message-ID: <955d87a8233a004338d0fa571dae23cfbe4c44cc.camel@sipsolutions.net> On Thu, 2018-06-21 at 09:25 -0700, Matti Picus wrote: > numpy.core has many ways to catalogue dtype names: sctypeDict, > typeDict > (which is precisely sctypeDict), typecodes, and typename. We also > generate sctypeNA and typeNA but, as issue 11241 shows, it is > sometimes > wrong. They are also not documented and never used inside numpy. > Instead > of fixing it, I propose to remove sctypeNA and typeNA. > Sounds like a good idea, we have too much stuff in there, and this one is not even useful (I bet the NA is for the missing value support that never happened). Might be good to do a quick deprecation anyway though, mostly out of principle. - Sebastian > Any thoughts or objections? > Matti > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From wieser.eric+numpy at gmail.com Thu Jun 21 14:22:16 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Thu, 21 Jun 2018 11:22:16 -0700 Subject: [Numpy-discussion] Remove sctypeNA and typeNA from numpy core In-Reply-To: <955d87a8233a004338d0fa571dae23cfbe4c44cc.camel@sipsolutions.net> References: <04fb0382-9a42-f4e8-bf32-baf215df09c6@gmail.com> <955d87a8233a004338d0fa571dae23cfbe4c44cc.camel@sipsolutions.net> Message-ID: > I bet the NA is for the missing value support thatnever happened Nope - NA stands for NumArray Eric On Thu, 21 Jun 2018 at 11:07 Sebastian Berg wrote: > On Thu, 2018-06-21 at 09:25 -0700, Matti Picus wrote: > > numpy.core has many ways to catalogue dtype names: sctypeDict, > > typeDict > > (which is precisely sctypeDict), typecodes, and typename. We also > > generate sctypeNA and typeNA but, as issue 11241 shows, it is > > sometimes > > wrong. They are also not documented and never used inside numpy. > > Instead > > of fixing it, I propose to remove sctypeNA and typeNA. > > > > Sounds like a good idea, we have too much stuff in there, and this one > is not even useful (I bet the NA is for the missing value support that > never happened). > > Might be good to do a quick deprecation anyway though, mostly out of > principle. > > - Sebastian > > > Any thoughts or objections? > > Matti > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jun 25 17:30:02 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 25 Jun 2018 17:30:02 -0400 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing Message-ID: Sebastian and I have revised a Numpy Enhancement Proposal that he started three years ago for overhauling NumPy's advanced indexing. We'd now like to present it for official consideration. Minor inline comments (e.g., typos) can be added to the latest pull request (https://github.com/numpy/numpy/pull/11414/files), but otherwise let's keep discussion on the mailing list. The NumPy website should update shortly with a rendered version ( http://www.numpy.org/neps/nep-0021-advanced-indexing.html), but until then please see the full text below. Cheers, Stephan ========================================= Simplified and explicit advanced indexing ========================================= :Author: Sebastian Berg :Author: Stephan Hoyer :Status: Draft :Type: Standards Track :Created: 2015-08-27 Abstract -------- NumPy's "advanced" indexing support for indexing arrays with other arrays is one of its most powerful and popular features. Unfortunately, the existing rules for advanced indexing with multiple array indices are typically confusing to both new, and in many cases even old, users of NumPy. Here we propose an overhaul and simplification of advanced indexing, including two new "indexer" attributes ``oindex`` and ``vindex`` to facilitate explicit indexing. Background ---------- Existing indexing operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NumPy arrays currently support a flexible range of indexing operations: - "Basic" indexing involving only slices, integers, ``np.newaxis`` and ellipsis (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element from the 0th axis, the first three elements from the 1st axis and inserting a new axis of size 1 at the end. Basic indexing always return a view of the indexed array's data. - "Advanced" indexing, also called "fancy" indexing, includes all cases where arrays are indexed by other arrays. Advanced indexing always makes a copy: - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for selecting positive elements. - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, 1]]`` for selecting the first two elements along the first axis. With multiple arrays, vectorized indexing uses broadcasting rules to combine indices along multiple dimensions. This allows for producing a result of arbitrary shape with arbitrary elements from the original arrays. - "Mixed" indexing involving any combinations of the other advancing types. This is no more powerful than vectorized indexing, but is sometimes more convenient. For clarity, we will refer to these existing rules as "legacy indexing". This is only a high-level summary; for more details, see NumPy's documentation and and `Examples` below. Outer indexing ~~~~~~~~~~~~~~ One broadly useful class of indexing operations is not supported: - "Outer" or orthogonal indexing treats one-dimensional arrays equivalently to slices for determining output shapes. The rule for outer indexing is that the result should be equivalent to independently indexing along each dimension with integer or boolean arrays as if both the indexed and indexing arrays were one-dimensional. This form of indexing is familiar to many users of other programming languages such as MATLAB, Fortran and R. The reason why NumPy omits support for outer indexing is that the rules for outer and vectorized conflict. Consider indexing a 2D array by two 1D integer arrays, e.g., ``x[[0, 1], [0, 1]]``: - Outer indexing is equivalent to combining multiple integer indices with ``itertools.product()``. The result in this case is another 2D array with all combinations of indexed elements, e.g., ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])`` - Vectorized indexing is equivalent to combining multiple integer indices with ``zip()``. The result in this case is a 1D array containing the diagonal elements, e.g., ``np.array([x[0, 0], x[1, 1]])``. This difference is a frequent stumbling block for new NumPy users. The outer indexing model is easier to understand, and is a natural generalization of slicing rules. But NumPy instead chose to support vectorized indexing, because it is strictly more powerful. It is always possible to emulate outer indexing by vectorized indexing with the right indices. To make this easier, NumPy includes utility objects and functions such as ``np.ogrid`` and ``np.ix_``, e.g., ``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for emulating fully general/mixed outer indexing, which could unambiguously allow for slices, integers, and 1D boolean and integer arrays. Mixed indexing ~~~~~~~~~~~~~~ NumPy's existing rules for combining multiple types of indexing in the same operation are quite complex, involving a number of edge cases. One reason why mixed indexing is particularly confusing is that at first glance the result works deceptively like outer indexing. Returning to our example of a 2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays with axes in the same order as the original array. However, as soon as two or more non-slice objects (including integers) are introduced, vectorized indexing rules apply. The axes introduced by the array indices are at the front, unless all array indices are consecutive, in which case NumPy deduces where the user "expects" them to be. Consider indexing a 3D array ``arr`` with shape ``(X, Y, Z)``: 1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``. 2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``. 3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``! These first two cases are intuitive and consistent with outer indexing, but this last case is quite surprising, even to many higly experienced NumPy users. Mixed cases involving multiple array indices are also surprising, and only less problematic because the current behavior is so useless that it is rarely encountered in practice. When a boolean array index is mixed with another boolean or integer array, boolean array is converted to integer array indices (equivalent to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array of size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D vector with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``. Mixed indexing seems so tricky that it is tempting to say that it never should be used. However, it is not easy to avoid, because NumPy implicitly adds full slices if there are fewer indices than the full dimensionality of the indexed array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent to ``x[[0, 1], :]``. These cases are not surprising, but they constrain the behavior of mixed indexing. Indexing in other Python array libraries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Indexing is a useful and widely recognized mechanism for accessing multi-dimensional array data, so it is no surprise that many other libraries in the scientific Python ecosystem also support array indexing. Unfortunately, the full complexity of NumPy's indexing rules mean that it is both challenging and undesirable for other libraries to copy its behavior in all of its nuance. The only full implementation of NumPy-style indexing is NumPy itself. This includes projects like dask.array and h5py, which support *most* types of array indexing in some form, and otherwise attempt to copy NumPy's API exactly. Vectorized indexing in particular can be challenging to implement with array storage backends not based on NumPy. In contrast, indexing by 1D arrays along at least one dimension in the style of outer indexing is much more acheivable. This has led many libraries (including dask and h5py) to attempt to define a safe subset of NumPy-style indexing that is equivalent to outer indexing, e.g., by only allowing indexing with an array along at most one dimension. However, this is quite challenging to do correctly in a general enough way to be useful. For example, the current versions of dask and h5py both handle mixed indexing in case 3 above inconsistently with NumPy. This is quite likely to lead to bugs. These inconsistencies, in addition to the broader challenge of implementing every type of indexing logic, make it challenging to write high-level array libraries like xarray or dask.array that can interchangeably index many types of array storage. In contrast, explicit APIs for outer and vectorized indexing in NumPy would provide a model that external libraries could reliably emulate, even if they don't support every type of indexing. High level changes ------------------ Inspired by multiple "indexer" attributes for controlling different types of indexing behavior in pandas, we propose to: 1. Introduce ``arr.oindex[indices]`` which allows array indices, but uses outer indexing logic. 2. Introduce ``arr.vindex[indices]`` which use the current "vectorized"/broadcasted logic but with two differences from legacy indexing: * Boolean indices are not supported. All indices must be integers, integer arrays or slices. * The integer index result dimensions are always the first axes of the result array. No transpose is done, even for a single integer array index. 3. Plain indexing on arrays will start to give warnings and eventually errors in cases where one of the explicit indexers should be preferred: * First, in all cases where legacy and outer indexing would give different results. * Later, potentially in all cases involving an integer array. These constraints are sufficient for making indexing generally consistent with expectations and providing a less surprising learning curve with ``oindex``. Note that all things mentioned here apply both for assignment as well as subscription. Understanding these details is *not* easy. The `Examples` section in the discussion gives code examples. And the hopefully easier `Motivational Example` provides some motivational use-cases for the general ideas and is likely a good start for anyone not intimately familiar with advanced indexing. Detailed Description -------------------- Proposed rules ~~~~~~~~~~~~~~ >From the three problems noted above some expectations for NumPy can be deduced: 1. There should be a prominent outer/orthogonal indexing method such as ``arr.oindex[indices]``. 2. Considering how confusing vectorized/fancy indexing can be, it should be possible to be made more explicitly (e.g. ``arr.vindex[indices]``). 3. A new ``arr.vindex[indices]`` method, would not be tied to the confusing transpose rules of fancy indexing, which is for example needed for the simple case of a single advanced index. Thus, no transposing should be done. The axes created by the integer array indices are always inserted at the front, even for a single index. 4. Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing is generally not helpful or well defined. A user who wishes the "``nonzero``" plus broadcast behaviour can thus be expected to do this manually. Thus, ``vindex`` does not need to support boolean index arrays. 5. An ``arr.legacy_index`` attribute should be implemented to support legacy indexing. This gives a simple way to update existing codebases using legacy indexing, which will make the deprecation of plain indexing behavior easier. The longer name ``legacy_index`` is intentionally chosen to be explicit and discourage its use in new code. 6. Plain indexing ``arr[...]`` should return an error for ambiguous cases. For the beginning, this probably means cases where ``arr[ind]`` and ``arr.oindex[ind]`` return different results give deprecation warnings. This includes every use of vectorized indexing with multiple integer arrays. Due to the transposing behaviour, this means that``arr[0, :, index_arr]`` will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time being. 7. To ensure that existing subclasses of `ndarray` that override indexing do not inadvertently revert to default behavior for indexing attributes, these attribute should have explicit checks that disable them if ``__getitem__`` or ``__setitem__`` has been overriden. Unlike plain indexing, the new indexing attributes are explicitly aimed at higher dimensional indexing, several additional changes should be implemented: * The indexing attributes will enforce exact dimension and indexing match. This means that no implicit ellipsis (``...``) will be added. Unless an ellipsis is present the indexing expression will thus only work for an array with a specific number of dimensions. This makes the expression more explicit and safeguards against wrong dimensionality of arrays. There should be no implications for "duck typing" compatibility with builtin Python sequences, because Python sequences only support a limited form of "basic indexing" with integers and slices. * The current plain indexing allows for the use of non-tuples for multi-dimensional indexing such as ``arr[[slice(None), 2]]``. This creates some inconsistencies and thus the indexing attributes should only allow plain python tuples for this purpose. (Whether or not this should be the case for plain indexing is a different issue.) * The new attributes should not use getitem to implement setitem, since it is a cludge and not useful for vectorized indexing. (not implemented yet) Open Questions ~~~~~~~~~~~~~~ * The names ``oindex``, ``vindex`` and ``legacy_index`` are just suggestions at the time of writing this, another name NumPy has used for something like ``oindex`` is ``np.ix_``. See also below. * ``oindex`` and ``vindex`` could always return copies, even when no array operation occurs. One argument for allowing a view return is that this way ``oindex`` can be used as a general index replacement. However, there is one argument for returning copies. It is possible for ``arr.vindex[array_scalar, ...]``, where ``array_scalar`` should be a 0-D array but is not, since 0-D arrays tend to be converted. Copying always "fixes" this possible inconsistency. * The final state to morph plain indexing in is not fixed in this PEP. It is for example possible that `arr[index]`` will be equivalent to ``arr.oindex`` at some point in the future. Since such a change will take years, it seems unnecessary to make specific decisions at this time. * The proposed changes to plain indexing could be postponed indefinitely or not taken in order to not break or force major fixes to existing code bases. Alternative Names ~~~~~~~~~~~~~~~~~ Possible names suggested (more suggestions will be added). ============== ============ ======== **Orthogonal** oindex oix **Vectorized** vindex vix **Legacy** legacy_index l/findex ============== ============ ======== Subclasses ~~~~~~~~~~ Subclasses are a bit problematic in the light of these changes. There are some possible solutions for this. For most subclasses (those which do not provide ``__getitem__`` or ``__setitem__``) the special attributes should just work. Subclasses that *do* provide it must be updated accordingly and should preferably not subclass working versions of these attributes. All subclasses will inherit the attributes, however, the implementation of ``__getitem__`` on these attributes should test ``subclass.__getitem__ is ndarray.__getitem__``. If not, the subclass has special handling for indexing and ``NotImplementedError`` should be raised, requiring that the indexing attributes is also explicitly overwritten. Likewise, implementations of ``__setitem__`` should check to see if ``__setitem__`` is overriden. A further question is how to facilitate implementing the special attributes. Also there is the weird functionality where ``__setitem__`` calls ``__getitem__`` for non-advanced indices. It might be good to avoid it for the new attributes, but on the other hand, that may make it even more confusing. To facilitate implementations we could provide functions similar to ``operator.itemgetter`` and ``operator.setitem`` for the attributes. Possibly a mixin could be provided to help implementation. These improvements are not essential to the initial implementation, so they are saved for future work. Implementation -------------- Implementation would start with writing special indexing objects available through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow these indexing operations. Also, we would need to start to deprecate those plain index operations which are not ambiguous. Furthermore, the NumPy code base will need to use the new attributes and tests will have to be adapted. Backward compatibility ---------------------- As a new feature, no backward compatibility issues with the new ``vindex`` and ``oindex`` attributes would arise. To facilitate backwards compatibility as much as possible, we expect a long deprecation cycle for legacy indexing behavior and propose the new ``legacy_index`` attribute. Some forward compatibility issues with subclasses that do not specifically implement the new methods may arise. Alternatives ------------ NumPy may not choose to offer these different type of indexing methods, or choose to only offer them through specific functions instead of the proposed notation above. We don't think that new functions are a good alternative, because indexing notation ``[]`` offer some syntactic advantages in Python (i.e., direct creation of slice objects) compared to functions. A more reasonable alternative would be write new wrapper objects for alternative indexing with functions rather than methods (e.g., ``np.oindex(arr)[indices]`` instead of ``arr.oindex[indices]``). Functionally, this would be equivalent, but indexing is such a common operation that we think it is important to minimize syntax and worth implementing it directly on `ndarray` objects themselves. Indexing attributes also define a clear interface that is easier for alternative array implementations to copy, nonwithstanding ongoing efforts to make it easier to override NumPy functions [2]_. Discussion ---------- The original discussion about vectorized vs outer/orthogonal indexing arose on the NumPy mailing list: * https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html Some discussion can be found on the original pull request for this NEP: * https://github.com/numpy/numpy/pull/6256 Python implementations of the indexing operations can be found at: * https://github.com/numpy/numpy/pull/5749 * https://gist.github.com/shoyer/c700193625347eb68fee4d1f0dc8c0c8 Examples ~~~~~~~~ Since the various kinds of indexing is hard to grasp in many cases, these examples hopefully give some more insights. Note that they are all in terms of shape. In the examples, all original dimensions have 5 or more elements, advanced indexing inserts smaller dimensions. These examples may be hard to grasp without working knowledge of advanced indexing as of NumPy 1.9. Example array:: >>> arr = np.ones((5, 6, 7, 8)) Legacy fancy indexing --------------------- Note that the same result can be achieved with ``arr.legacy_index``, but the "future error" will still work in this case. Single index is transposed (this is the same for all indexing types):: >>> arr[[0], ...].shape (1, 6, 7, 8) >>> arr[:, [0], ...].shape (5, 1, 7, 8) Multiple indices are transposed *if* consecutive:: >>> arr[:, [0], [0], :].shape # future error (5, 1, 8) >>> arr[:, [0], :, [0]].shape # future error (1, 5, 7) It is important to note that a scalar *is* integer array index in this sense (and gets broadcasted with the other advanced index):: >>> arr[:, [0], 0, :].shape (5, 1, 8) >>> arr[:, [0], :, 0].shape # future error (scalar is "fancy") (1, 5, 7) Single boolean index can act on multiple dimensions (especially the whole array). It has to match (as of 1.10. a deprecation warning) the dimensions. The boolean index is otherwise identical to (multiple consecutive) integer array indices:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[0, 0] = True >>> arr[:, 0, bindx].shape (5, 1) >>> arr[0, :, bindx].shape (1, 6) The combination with anything that is not a scalar is confusing, e.g.:: >>> arr[[0], :, bindx].shape # bindx result broadcasts with [0] (1, 6) >>> arr[:, [0, 1], bindx].shape # IndexError Outer indexing -------------- Multiple indices are "orthogonal" and their result axes are inserted at the same place (they are not broadcasted):: >>> arr.oindex[:, [0], [0, 1], :].shape (5, 1, 2, 8) >>> arr.oindex[:, [0], :, [0, 1]].shape (5, 1, 7, 2) >>> arr.oindex[:, [0], 0, :].shape (5, 1, 8) >>> arr.oindex[:, [0], :, 0].shape (5, 1, 7) Boolean indices results are always inserted where the index is:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[0, 0] = True >>> arr.oindex[:, 0, bindx].shape (5, 1) >>> arr.oindex[0, :, bindx].shape (6, 1) Nothing changed in the presence of other advanced indices since:: >>> arr.oindex[[0], :, bindx].shape (1, 6, 1) >>> arr.oindex[:, [0, 1], bindx].shape (5, 2, 1) Vectorized/inner indexing ------------------------- Multiple indices are broadcasted and iterated as one like fancy indexing, but the new axes area always inserted at the front:: >>> arr.vindex[:, [0], [0, 1], :].shape (2, 5, 8) >>> arr.vindex[:, [0], :, [0, 1]].shape (2, 5, 7) >>> arr.vindex[:, [0], 0, :].shape (1, 5, 8) >>> arr.vindex[:, [0], :, 0].shape (1, 5, 7) Boolean indices results are always inserted where the index is, exactly as in ``oindex`` given how specific they are to the axes they operate on:: >>> # Create boolean index with one True value for the last two dimensions: >>> bindx = np.zeros((7, 8), dtype=np.bool_) >>> bindx[0, 0] = True >>> arr.vindex[:, 0, bindx].shape (5, 1) >>> arr.vindex[0, :, bindx].shape (6, 1) But other advanced indices are again transposed to the front:: >>> arr.vindex[[0], :, bindx].shape (1, 6, 1) >>> arr.vindex[:, [0, 1], bindx].shape (2, 5, 1) Motivational Example ~~~~~~~~~~~~~~~~~~~~ Imagine having a data acquisition software storing ``D`` channels and ``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped array. During data analysis, we needs to fetch a pool of channels, for example to calculate a mean over them. This data can be faked using:: >>> arr = np.random.random((100, 10)) Now one may remember indexing with an integer array and find the correct code:: >>> group = arr[:, [2, 5]] >>> mean_value = arr.mean() However, assume that there were some specific time points (first dimension of the data) that need to be specially considered. These time points are already known and given by:: >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp) Now to fetch them, we may try to modify the previous code:: >>> group_at_it = arr[interesting_times, [2, 5]] IndexError: Ambiguous index, use `.oindex` or `.vindex` An error such as this will point to read up the indexing documentation. This should make it clear, that ``oindex`` behaves more like slicing. So, out of the different methods it is the obvious choice (for now, this is a shape mismatch, but that could possibly also mention ``oindex``):: >>> group_at_it = arr.oindex[interesting_times, [2, 5]] Now of course one could also have used ``vindex``, but it is much less obvious how to achieve the right thing!:: >>> reshaped_times = interesting_times[:, np.newaxis] >>> group_at_it = arr.vindex[reshaped_times, [2, 5]] One may find, that for example our data is corrupt in some places. So, we need to replace these values by zero (or anything else) for these times. The first column may for example give the necessary information, so that changing the values becomes easy remembering boolean indexing:: >>> bad_data = arr[:, 0] > 0.5 >>> arr[bad_data, :] = 0 # (corrupts further examples) Again, however, the columns may need to be handled more individually (but in groups), and the ``oindex`` attribute works well:: >>> arr.oindex[bad_data, [2, 5]] = 0 Note that it would be very hard to do this using legacy fancy indexing. The only way would be to create an integer array first:: >>> bad_data_indx = np.nonzero(bad_data)[0] >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis] >>> arr[bad_data_indx_reshaped, [2, 5]] In any case we can use only ``oindex`` to do all of this without getting into any trouble or confused by the whole complexity of advanced indexing. But, some new features are added to the data acquisition. Different sensors have to be used depending on the times. Let us assume we already have created an array of indices:: >>> correct_sensors = np.random.randint(10, size=(100, 2)) Which lists for each time the two correct sensors in an ``(N, 2)`` array. A first try to achieve this may be ``arr[:, correct_sensors]`` and this does not work. It should be clear quickly that slicing cannot achieve the desired thing. But hopefully users will remember that there is ``vindex`` as a more powerful and flexible approach to advanced indexing. One may, if trying ``vindex`` randomly, be confused about:: >>> new_arr = arr.vindex[:, correct_sensors] which is neither the same, nor the correct result (see transposing rules)! This is because slicing works still the same in ``vindex``. However, reading the documentation and examples, one can hopefully quickly find the desired solution:: >>> rows = np.arange(len(arr)) >>> rows = rows[:, np.newaxis] # make shape fit with correct_sensors >>> new_arr = arr.vindex[rows, correct_sensors] At this point we have left the straight forward world of ``oindex`` but can do random picking of any element from the array. Note that in the last example a method such as mentioned in the ``Related Questions`` section could be more straight forward. But this approach is even more flexible, since ``rows`` does not have to be a simple ``arange``, but could be ``intersting_times``:: >>> interesting_times = np.array([0, 4, 8, 9, 10]) >>> correct_sensors_at_it = correct_sensors[interesting_times, :] >>> interesting_times_reshaped = interesting_times[:, np.newaxis] >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it] Truly complex situation would arise now if you would for example pool ``L`` experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this should not result into surprises. ``vindex``, being more powerful, will quite certainly create some confusion in this case but also cover pretty much all eventualities. Copyright --------- This document is placed under the CC0 1.0 Universell (CC0 1.0) Public Domain Dedication [1]_. References and Footnotes ------------------------ .. [1] To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work. The CC0 license may be found at https://creativecommons.org/publicdomain/zero/1.0/ .. [2] e.g., see NEP 18, http://www.numpy.org/neps/nep-0018-array-function-protocol.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Mon Jun 25 23:06:42 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Mon, 25 Jun 2018 20:06:42 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: Generally +1 on this, but I don?t think we need To ensure that existing subclasses of ndarray that override indexing do not inadvertently revert to default behavior for indexing attributes, these attribute should have explicit checks that disable them if __getitem__ or __setitem__ has been overriden. Repeating my proposal from github, I think we should introduce some internal indexing objects - something simple like: # np.core.*class Indexer(object): # importantly not iterable def __init__(self, value): self.value = valueclass OrthogonalIndexer(Indexer): passclass VectorizedIndexer(Indexer): pass Keeping the proposed syntax, we?d implement: - arr.oindex[ind] as arr[np.core.OrthogonalIndexer(ind)] - arr.vindex[ind] as arr[np.core.VectorizedIndexer(ind)] This means that subclasses like the following class LoggingIndexer(np.ndarray): def __getitem__(self, ind): ret = super().__getitem__(ind) print("Got an index") return ret will continue to work without issues. This includes np.ma.MaskedArray and np.memmap, so this already has value internally. For classes like np.matrix which inspect the index object itself, an error will still be raised from __getitem__, since it looks nothing like the values normally passed - most likely of the form TypeError: 'numpy.core.VectorizedIndexer' object does not support indexing TypeError: 'numpy.core.VectorizedIndexer' object is not iterable This could potentially be caught in oindex.__getitem__ and converted into a more useful error message. So to summarize the benefits of the above tweaks: - Pass-through subclasses get the new behavior for free - No additional descriptor helpers are needed to let non-passthrough subclasses implement the new indexable attributes - only a change to __getitem__ is needed And the costs: - A less clear error message when new indexing is used on old types (can chain with a more useful exception on python 3) - Class construction overhead for indexing via the attributes (skippable for base ndarray if significant) Eric ? On Mon, 25 Jun 2018 at 14:30 Stephan Hoyer wrote: > Sebastian and I have revised a Numpy Enhancement Proposal that he started > three years ago for overhauling NumPy's advanced indexing. We'd now like to > present it for official consideration. > > Minor inline comments (e.g., typos) can be added to the latest pull > request (https://github.com/numpy/numpy/pull/11414/files), but otherwise > let's keep discussion on the mailing list. The NumPy website should update > shortly with a rendered version ( > http://www.numpy.org/neps/nep-0021-advanced-indexing.html), but until > then please see the full text below. > > Cheers, > Stephan > > ========================================= > Simplified and explicit advanced indexing > ========================================= > > :Author: Sebastian Berg > :Author: Stephan Hoyer > :Status: Draft > :Type: Standards Track > :Created: 2015-08-27 > > > Abstract > -------- > > NumPy's "advanced" indexing support for indexing arrays with other arrays > is > one of its most powerful and popular features. Unfortunately, the existing > rules for advanced indexing with multiple array indices are typically > confusing > to both new, and in many cases even old, users of NumPy. Here we propose an > overhaul and simplification of advanced indexing, including two new > "indexer" > attributes ``oindex`` and ``vindex`` to facilitate explicit indexing. > > Background > ---------- > > Existing indexing operations > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > NumPy arrays currently support a flexible range of indexing operations: > > - "Basic" indexing involving only slices, integers, ``np.newaxis`` and > ellipsis > (``...``), e.g., ``x[0, :3, np.newaxis]`` for selecting the first element > from the 0th axis, the first three elements from the 1st axis and > inserting a > new axis of size 1 at the end. Basic indexing always return a view of the > indexed array's data. > - "Advanced" indexing, also called "fancy" indexing, includes all cases > where > arrays are indexed by other arrays. Advanced indexing always makes a > copy: > > - "Boolean" indexing by boolean arrays, e.g., ``x[x > 0]`` for > selecting positive elements. > - "Vectorized" indexing by one or more integer arrays, e.g., ``x[[0, > 1]]`` > for selecting the first two elements along the first axis. With > multiple > arrays, vectorized indexing uses broadcasting rules to combine indices > along > multiple dimensions. This allows for producing a result of arbitrary > shape > with arbitrary elements from the original arrays. > - "Mixed" indexing involving any combinations of the other advancing > types. > This is no more powerful than vectorized indexing, but is sometimes > more > convenient. > > For clarity, we will refer to these existing rules as "legacy indexing". > This is only a high-level summary; for more details, see NumPy's > documentation > and and `Examples` below. > > Outer indexing > ~~~~~~~~~~~~~~ > > One broadly useful class of indexing operations is not supported: > > - "Outer" or orthogonal indexing treats one-dimensional arrays > equivalently to > slices for determining output shapes. The rule for outer indexing is > that the > result should be equivalent to independently indexing along each > dimension > with integer or boolean arrays as if both the indexed and indexing arrays > were one-dimensional. This form of indexing is familiar to many users of > other > programming languages such as MATLAB, Fortran and R. > > The reason why NumPy omits support for outer indexing is that the rules for > outer and vectorized conflict. Consider indexing a 2D array by two 1D > integer > arrays, e.g., ``x[[0, 1], [0, 1]]``: > > - Outer indexing is equivalent to combining multiple integer indices with > ``itertools.product()``. The result in this case is another 2D array with > all combinations of indexed elements, e.g., > ``np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])`` > - Vectorized indexing is equivalent to combining multiple integer indices > with > ``zip()``. The result in this case is a 1D array containing the diagonal > elements, e.g., ``np.array([x[0, 0], x[1, 1]])``. > > This difference is a frequent stumbling block for new NumPy users. The > outer > indexing model is easier to understand, and is a natural generalization of > slicing rules. But NumPy instead chose to support vectorized indexing, > because > it is strictly more powerful. > > It is always possible to emulate outer indexing by vectorized indexing with > the right indices. To make this easier, NumPy includes utility objects and > functions such as ``np.ogrid`` and ``np.ix_``, e.g., > ``x[np.ix_([0, 1], [0, 1])]``. However, there are no utilities for > emulating > fully general/mixed outer indexing, which could unambiguously allow for > slices, > integers, and 1D boolean and integer arrays. > > Mixed indexing > ~~~~~~~~~~~~~~ > > NumPy's existing rules for combining multiple types of indexing in the same > operation are quite complex, involving a number of edge cases. > > One reason why mixed indexing is particularly confusing is that at first > glance > the result works deceptively like outer indexing. Returning to our example > of a > 2D array, both ``x[:2, [0, 1]]`` and ``x[[0, 1], :2]`` return 2D arrays > with > axes in the same order as the original array. > > However, as soon as two or more non-slice objects (including integers) are > introduced, vectorized indexing rules apply. The axes introduced by the > array > indices are at the front, unless all array indices are consecutive, in > which > case NumPy deduces where the user "expects" them to be. Consider indexing > a 3D > array ``arr`` with shape ``(X, Y, Z)``: > > 1. ``arr[:, [0, 1], 0]`` has shape ``(X, 2)``. > 2. ``arr[[0, 1], 0, :]`` has shape ``(2, Z)``. > 3. ``arr[0, :, [0, 1]]`` has shape ``(2, Y)``, not ``(Y, 2)``! > > These first two cases are intuitive and consistent with outer indexing, but > this last case is quite surprising, even to many higly experienced NumPy > users. > > Mixed cases involving multiple array indices are also surprising, and only > less problematic because the current behavior is so useless that it is > rarely > encountered in practice. When a boolean array index is mixed with another > boolean or > integer array, boolean array is converted to integer array indices > (equivalent > to ``np.nonzero()``) and then broadcast. For example, indexing a 2D array > of > size ``(2, 2)`` like ``x[[True, False], [True, False]]`` produces a 1D > vector > with shape ``(1,)``, not a 2D sub-matrix with shape ``(1, 1)``. > > Mixed indexing seems so tricky that it is tempting to say that it never > should > be used. However, it is not easy to avoid, because NumPy implicitly adds > full > slices if there are fewer indices than the full dimensionality of the > indexed > array. This means that indexing a 2D array like `x[[0, 1]]`` is equivalent > to > ``x[[0, 1], :]``. These cases are not surprising, but they constrain the > behavior of mixed indexing. > > Indexing in other Python array libraries > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Indexing is a useful and widely recognized mechanism for accessing > multi-dimensional array data, so it is no surprise that many other > libraries in > the scientific Python ecosystem also support array indexing. > > Unfortunately, the full complexity of NumPy's indexing rules mean that it > is > both challenging and undesirable for other libraries to copy its behavior > in all > of its nuance. The only full implementation of NumPy-style indexing is > NumPy > itself. This includes projects like dask.array and h5py, which support > *most* > types of array indexing in some form, and otherwise attempt to copy > NumPy's API > exactly. > > Vectorized indexing in particular can be challenging to implement with > array > storage backends not based on NumPy. In contrast, indexing by 1D arrays > along > at least one dimension in the style of outer indexing is much more > acheivable. > This has led many libraries (including dask and h5py) to attempt to define > a > safe subset of NumPy-style indexing that is equivalent to outer indexing, > e.g., > by only allowing indexing with an array along at most one dimension. > However, > this is quite challenging to do correctly in a general enough way to be > useful. > For example, the current versions of dask and h5py both handle mixed > indexing > in case 3 above inconsistently with NumPy. This is quite likely to lead to > bugs. > > These inconsistencies, in addition to the broader challenge of implementing > every type of indexing logic, make it challenging to write high-level array > libraries like xarray or dask.array that can interchangeably index many > types of > array storage. In contrast, explicit APIs for outer and vectorized > indexing in > NumPy would provide a model that external libraries could reliably > emulate, even > if they don't support every type of indexing. > > High level changes > ------------------ > > Inspired by multiple "indexer" attributes for controlling different types > of indexing behavior in pandas, we propose to: > > 1. Introduce ``arr.oindex[indices]`` which allows array indices, but > uses outer indexing logic. > 2. Introduce ``arr.vindex[indices]`` which use the current > "vectorized"/broadcasted logic but with two differences from > legacy indexing: > > * Boolean indices are not supported. All indices must be integers, > integer arrays or slices. > * The integer index result dimensions are always the first axes > of the result array. No transpose is done, even for a single > integer array index. > > 3. Plain indexing on arrays will start to give warnings and eventually > errors in cases where one of the explicit indexers should be preferred: > > * First, in all cases where legacy and outer indexing would give > different results. > * Later, potentially in all cases involving an integer array. > > These constraints are sufficient for making indexing generally consistent > with expectations and providing a less surprising learning curve with > ``oindex``. > > Note that all things mentioned here apply both for assignment as well as > subscription. > > Understanding these details is *not* easy. The `Examples` section in the > discussion gives code examples. > And the hopefully easier `Motivational Example` provides some > motivational use-cases for the general ideas and is likely a good start for > anyone not intimately familiar with advanced indexing. > > > Detailed Description > -------------------- > > Proposed rules > ~~~~~~~~~~~~~~ > > From the three problems noted above some expectations for NumPy can > be deduced: > > 1. There should be a prominent outer/orthogonal indexing method such as > ``arr.oindex[indices]``. > > 2. Considering how confusing vectorized/fancy indexing can be, it should > be possible to be made more explicitly (e.g. ``arr.vindex[indices]``). > > 3. A new ``arr.vindex[indices]`` method, would not be tied to the > confusing transpose rules of fancy indexing, which is for example > needed for the simple case of a single advanced index. Thus, > no transposing should be done. The axes created by the integer array > indices are always inserted at the front, even for a single index. > > 4. Boolean indexing is conceptionally outer indexing. Broadcasting > together with other advanced indices in the manner of legacy > indexing is generally not helpful or well defined. > A user who wishes the "``nonzero``" plus broadcast behaviour can thus > be expected to do this manually. Thus, ``vindex`` does not need to > support boolean index arrays. > > 5. An ``arr.legacy_index`` attribute should be implemented to support > legacy indexing. This gives a simple way to update existing codebases > using legacy indexing, which will make the deprecation of plain indexing > behavior easier. The longer name ``legacy_index`` is intentionally > chosen > to be explicit and discourage its use in new code. > > 6. Plain indexing ``arr[...]`` should return an error for ambiguous cases. > For the beginning, this probably means cases where ``arr[ind]`` and > ``arr.oindex[ind]`` return different results give deprecation warnings. > This includes every use of vectorized indexing with multiple integer > arrays. > Due to the transposing behaviour, this means that``arr[0, :, > index_arr]`` > will be deprecated, but ``arr[:, 0, index_arr]`` will not for the time > being. > > 7. To ensure that existing subclasses of `ndarray` that override indexing > do not inadvertently revert to default behavior for indexing attributes, > these attribute should have explicit checks that disable them if > ``__getitem__`` or ``__setitem__`` has been overriden. > > Unlike plain indexing, the new indexing attributes are explicitly aimed > at higher dimensional indexing, several additional changes should be > implemented: > > * The indexing attributes will enforce exact dimension and indexing match. > This means that no implicit ellipsis (``...``) will be added. Unless > an ellipsis is present the indexing expression will thus only work for > an array with a specific number of dimensions. > This makes the expression more explicit and safeguards against wrong > dimensionality of arrays. > There should be no implications for "duck typing" compatibility with > builtin Python sequences, because Python sequences only support a limited > form of "basic indexing" with integers and slices. > > * The current plain indexing allows for the use of non-tuples for > multi-dimensional indexing such as ``arr[[slice(None), 2]]``. > This creates some inconsistencies and thus the indexing attributes > should only allow plain python tuples for this purpose. > (Whether or not this should be the case for plain indexing is a > different issue.) > > * The new attributes should not use getitem to implement setitem, > since it is a cludge and not useful for vectorized > indexing. (not implemented yet) > > > Open Questions > ~~~~~~~~~~~~~~ > > * The names ``oindex``, ``vindex`` and ``legacy_index`` are just > suggestions at > the time of writing this, another name NumPy has used for something like > ``oindex`` is ``np.ix_``. See also below. > > * ``oindex`` and ``vindex`` could always return copies, even when no array > operation occurs. One argument for allowing a view return is that this > way > ``oindex`` can be used as a general index replacement. > However, there is one argument for returning copies. It is possible for > ``arr.vindex[array_scalar, ...]``, where ``array_scalar`` should be > a 0-D array but is not, since 0-D arrays tend to be converted. > Copying always "fixes" this possible inconsistency. > > * The final state to morph plain indexing in is not fixed in this PEP. > It is for example possible that `arr[index]`` will be equivalent to > ``arr.oindex`` at some point in the future. > Since such a change will take years, it seems unnecessary to make > specific decisions at this time. > > * The proposed changes to plain indexing could be postponed indefinitely or > not taken in order to not break or force major fixes to existing code > bases. > > > Alternative Names > ~~~~~~~~~~~~~~~~~ > > Possible names suggested (more suggestions will be added). > > ============== ============ ======== > **Orthogonal** oindex oix > **Vectorized** vindex vix > **Legacy** legacy_index l/findex > ============== ============ ======== > > > Subclasses > ~~~~~~~~~~ > > Subclasses are a bit problematic in the light of these changes. There are > some possible solutions for this. For most subclasses (those which do not > provide ``__getitem__`` or ``__setitem__``) the special attributes should > just work. Subclasses that *do* provide it must be updated accordingly > and should preferably not subclass working versions of these attributes. > > All subclasses will inherit the attributes, however, the implementation > of ``__getitem__`` on these attributes should test > ``subclass.__getitem__ is ndarray.__getitem__``. If not, the > subclass has special handling for indexing and ``NotImplementedError`` > should be raised, requiring that the indexing attributes is also explicitly > overwritten. Likewise, implementations of ``__setitem__`` should check to > see > if ``__setitem__`` is overriden. > > A further question is how to facilitate implementing the special > attributes. > Also there is the weird functionality where ``__setitem__`` calls > ``__getitem__`` for non-advanced indices. It might be good to avoid it for > the new attributes, but on the other hand, that may make it even more > confusing. > > To facilitate implementations we could provide functions similar to > ``operator.itemgetter`` and ``operator.setitem`` for the attributes. > Possibly a mixin could be provided to help implementation. These > improvements > are not essential to the initial implementation, so they are saved for > future work. > > Implementation > -------------- > > Implementation would start with writing special indexing objects available > through ``arr.oindex``, ``arr.vindex``, and ``arr.legacy_index`` to allow > these > indexing operations. Also, we would need to start to deprecate those plain > index > operations which are not ambiguous. > Furthermore, the NumPy code base will need to use the new attributes and > tests will have to be adapted. > > > Backward compatibility > ---------------------- > > As a new feature, no backward compatibility issues with the new ``vindex`` > and ``oindex`` attributes would arise. To facilitate backwards > compatibility > as much as possible, we expect a long deprecation cycle for legacy indexing > behavior and propose the new ``legacy_index`` attribute. > Some forward compatibility issues with subclasses that do not specifically > implement the new methods may arise. > > > Alternatives > ------------ > > NumPy may not choose to offer these different type of indexing methods, or > choose to only offer them through specific functions instead of the > proposed > notation above. > > We don't think that new functions are a good alternative, because indexing > notation ``[]`` offer some syntactic advantages in Python (i.e., direct > creation of slice objects) compared to functions. > > A more reasonable alternative would be write new wrapper objects for > alternative > indexing with functions rather than methods (e.g., > ``np.oindex(arr)[indices]`` > instead of ``arr.oindex[indices]``). Functionally, this would be > equivalent, > but indexing is such a common operation that we think it is important to > minimize syntax and worth implementing it directly on `ndarray` objects > themselves. Indexing attributes also define a clear interface that is > easier > for alternative array implementations to copy, nonwithstanding ongoing > efforts to make it easier to override NumPy functions [2]_. > > Discussion > ---------- > > The original discussion about vectorized vs outer/orthogonal indexing arose > on the NumPy mailing list: > > * > https://mail.python.org/pipermail/numpy-discussion/2015-April/072550.html > > Some discussion can be found on the original pull request for this NEP: > > * https://github.com/numpy/numpy/pull/6256 > > Python implementations of the indexing operations can be found at: > > * https://github.com/numpy/numpy/pull/5749 > * https://gist.github.com/shoyer/c700193625347eb68fee4d1f0dc8c0c8 > > > Examples > ~~~~~~~~ > > Since the various kinds of indexing is hard to grasp in many cases, these > examples hopefully give some more insights. Note that they are all in terms > of shape. > In the examples, all original dimensions have 5 or more elements, > advanced indexing inserts smaller dimensions. > These examples may be hard to grasp without working knowledge of advanced > indexing as of NumPy 1.9. > > Example array:: > > >>> arr = np.ones((5, 6, 7, 8)) > > > Legacy fancy indexing > --------------------- > > Note that the same result can be achieved with ``arr.legacy_index``, but > the > "future error" will still work in this case. > > Single index is transposed (this is the same for all indexing types):: > > >>> arr[[0], ...].shape > (1, 6, 7, 8) > >>> arr[:, [0], ...].shape > (5, 1, 7, 8) > > > Multiple indices are transposed *if* consecutive:: > > >>> arr[:, [0], [0], :].shape # future error > (5, 1, 8) > >>> arr[:, [0], :, [0]].shape # future error > (1, 5, 7) > > > It is important to note that a scalar *is* integer array index in this > sense > (and gets broadcasted with the other advanced index):: > > >>> arr[:, [0], 0, :].shape > (5, 1, 8) > >>> arr[:, [0], :, 0].shape # future error (scalar is "fancy") > (1, 5, 7) > > > Single boolean index can act on multiple dimensions (especially the whole > array). It has to match (as of 1.10. a deprecation warning) the dimensions. > The boolean index is otherwise identical to (multiple consecutive) integer > array indices:: > > >>> # Create boolean index with one True value for the last two > dimensions: > >>> bindx = np.zeros((7, 8), dtype=np.bool_) > >>> bindx[0, 0] = True > >>> arr[:, 0, bindx].shape > (5, 1) > >>> arr[0, :, bindx].shape > (1, 6) > > > The combination with anything that is not a scalar is confusing, e.g.:: > > >>> arr[[0], :, bindx].shape # bindx result broadcasts with [0] > (1, 6) > >>> arr[:, [0, 1], bindx].shape # IndexError > > > Outer indexing > -------------- > > Multiple indices are "orthogonal" and their result axes are inserted > at the same place (they are not broadcasted):: > > >>> arr.oindex[:, [0], [0, 1], :].shape > (5, 1, 2, 8) > >>> arr.oindex[:, [0], :, [0, 1]].shape > (5, 1, 7, 2) > >>> arr.oindex[:, [0], 0, :].shape > (5, 1, 8) > >>> arr.oindex[:, [0], :, 0].shape > (5, 1, 7) > > > Boolean indices results are always inserted where the index is:: > > >>> # Create boolean index with one True value for the last two > dimensions: > >>> bindx = np.zeros((7, 8), dtype=np.bool_) > >>> bindx[0, 0] = True > >>> arr.oindex[:, 0, bindx].shape > (5, 1) > >>> arr.oindex[0, :, bindx].shape > (6, 1) > > > Nothing changed in the presence of other advanced indices since:: > > >>> arr.oindex[[0], :, bindx].shape > (1, 6, 1) > >>> arr.oindex[:, [0, 1], bindx].shape > (5, 2, 1) > > > Vectorized/inner indexing > ------------------------- > > Multiple indices are broadcasted and iterated as one like fancy indexing, > but the new axes area always inserted at the front:: > > >>> arr.vindex[:, [0], [0, 1], :].shape > (2, 5, 8) > >>> arr.vindex[:, [0], :, [0, 1]].shape > (2, 5, 7) > >>> arr.vindex[:, [0], 0, :].shape > (1, 5, 8) > >>> arr.vindex[:, [0], :, 0].shape > (1, 5, 7) > > > Boolean indices results are always inserted where the index is, exactly > as in ``oindex`` given how specific they are to the axes they operate on:: > > >>> # Create boolean index with one True value for the last two > dimensions: > >>> bindx = np.zeros((7, 8), dtype=np.bool_) > >>> bindx[0, 0] = True > >>> arr.vindex[:, 0, bindx].shape > (5, 1) > >>> arr.vindex[0, :, bindx].shape > (6, 1) > > > But other advanced indices are again transposed to the front:: > > >>> arr.vindex[[0], :, bindx].shape > (1, 6, 1) > >>> arr.vindex[:, [0, 1], bindx].shape > (2, 5, 1) > > > Motivational Example > ~~~~~~~~~~~~~~~~~~~~ > > Imagine having a data acquisition software storing ``D`` channels and > ``N`` datapoints along the time. She stores this into an ``(N, D)`` shaped > array. During data analysis, we needs to fetch a pool of channels, for > example > to calculate a mean over them. > > This data can be faked using:: > > >>> arr = np.random.random((100, 10)) > > Now one may remember indexing with an integer array and find the correct > code:: > > >>> group = arr[:, [2, 5]] > >>> mean_value = arr.mean() > > However, assume that there were some specific time points (first dimension > of the data) that need to be specially considered. These time points are > already known and given by:: > > >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp) > > Now to fetch them, we may try to modify the previous code:: > > >>> group_at_it = arr[interesting_times, [2, 5]] > IndexError: Ambiguous index, use `.oindex` or `.vindex` > > An error such as this will point to read up the indexing documentation. > This should make it clear, that ``oindex`` behaves more like slicing. > So, out of the different methods it is the obvious choice > (for now, this is a shape mismatch, but that could possibly also mention > ``oindex``):: > > >>> group_at_it = arr.oindex[interesting_times, [2, 5]] > > Now of course one could also have used ``vindex``, but it is much less > obvious how to achieve the right thing!:: > > >>> reshaped_times = interesting_times[:, np.newaxis] > >>> group_at_it = arr.vindex[reshaped_times, [2, 5]] > > > One may find, that for example our data is corrupt in some places. > So, we need to replace these values by zero (or anything else) for these > times. The first column may for example give the necessary information, > so that changing the values becomes easy remembering boolean indexing:: > > >>> bad_data = arr[:, 0] > 0.5 > >>> arr[bad_data, :] = 0 # (corrupts further examples) > > Again, however, the columns may need to be handled more individually (but > in > groups), and the ``oindex`` attribute works well:: > > >>> arr.oindex[bad_data, [2, 5]] = 0 > > Note that it would be very hard to do this using legacy fancy indexing. > The only way would be to create an integer array first:: > > >>> bad_data_indx = np.nonzero(bad_data)[0] > >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis] > >>> arr[bad_data_indx_reshaped, [2, 5]] > > In any case we can use only ``oindex`` to do all of this without getting > into any trouble or confused by the whole complexity of advanced indexing. > > But, some new features are added to the data acquisition. Different sensors > have to be used depending on the times. Let us assume we already have > created an array of indices:: > > >>> correct_sensors = np.random.randint(10, size=(100, 2)) > > Which lists for each time the two correct sensors in an ``(N, 2)`` array. > > A first try to achieve this may be ``arr[:, correct_sensors]`` and this > does > not work. It should be clear quickly that slicing cannot achieve the > desired > thing. But hopefully users will remember that there is ``vindex`` as a more > powerful and flexible approach to advanced indexing. > One may, if trying ``vindex`` randomly, be confused about:: > > >>> new_arr = arr.vindex[:, correct_sensors] > > which is neither the same, nor the correct result (see transposing rules)! > This is because slicing works still the same in ``vindex``. However, > reading > the documentation and examples, one can hopefully quickly find the desired > solution:: > > >>> rows = np.arange(len(arr)) > >>> rows = rows[:, np.newaxis] # make shape fit with correct_sensors > >>> new_arr = arr.vindex[rows, correct_sensors] > > At this point we have left the straight forward world of ``oindex`` but can > do random picking of any element from the array. Note that in the last > example > a method such as mentioned in the ``Related Questions`` section could be > more > straight forward. But this approach is even more flexible, since ``rows`` > does not have to be a simple ``arange``, but could be > ``intersting_times``:: > > >>> interesting_times = np.array([0, 4, 8, 9, 10]) > >>> correct_sensors_at_it = correct_sensors[interesting_times, :] > >>> interesting_times_reshaped = interesting_times[:, np.newaxis] > >>> new_arr_it = arr[interesting_times_reshaped, correct_sensors_at_it] > > Truly complex situation would arise now if you would for example pool ``L`` > experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this > should > not result into surprises. ``vindex``, being more powerful, will quite > certainly create some confusion in this case but also cover pretty much all > eventualities. > > > Copyright > --------- > > This document is placed under the CC0 1.0 Universell (CC0 1.0) Public > Domain Dedication [1]_. > > > References and Footnotes > ------------------------ > > .. [1] To the extent possible under law, the person who associated CC0 > with this work has waived all copyright and related or neighboring > rights to this work. The CC0 license may be found at > https://creativecommons.org/publicdomain/zero/1.0/ > .. [2] e.g., see NEP 18, > http://www.numpy.org/neps/nep-0018-array-function-protocol.html > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Tue Jun 26 02:24:13 2018 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Tue, 26 Jun 2018 16:24:13 +1000 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> > Plain indexing arr[...] should return an error for ambiguous cases. > [...] This includes every use of vectorized indexing with multiple > integer arrays. This line concerns me. In scikit-image, we often do: rr, cc = coords.T # coords is an (n, 2) array of integer coordinates values = image[rr, cc] Are you saying that this use is deprecated? Because we love it at scikit- image. I would be very very very sad to lose this syntax. > The current plain indexing allows for the use of non-tuples for multi- > dimensional indexing. I believe this paragraph is itself deprecated? Didn't non-non-tuple indexing just get deprecated with 1.15? Other general comments: - oindex in general seems very intuitive and I'm :+1: - I would much prefer some extremely compact notation such as arr.ox[] and arr.vx.- Depending on the above concern I am either -1 or (-1/0) on the deprecation. Deprecating (all) old vindex behaviour doesn't seem to bring many benefits while potentially causing a lot of pain to downstream libraries. Juan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andyfaff at gmail.com Tue Jun 26 02:28:22 2018 From: andyfaff at gmail.com (Andrew Nelson) Date: Tue, 26 Jun 2018 16:28:22 +1000 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias wrote: > > Plain indexing arr[...] should return an error for ambiguous cases. > [...] This includes every use of vectorized indexing with multiple integer > arrays. > > This line concerns me. In scikit-image, we often do: > > rr, cc = coords.T # coords is an (n, 2) array of integer coordinates > values = image[rr, cc] > > Are you saying that this use is deprecated? Because we love it at > scikit-image. I would be very very very sad to lose this syntax. > I second Juan's sentiments wholeheartedly here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 02:45:34 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 25 Jun 2018 23:45:34 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Mon, Jun 25, 2018 at 11:29 PM Andrew Nelson wrote: > On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias > wrote: > >> > Plain indexing arr[...] should return an error for ambiguous cases. >> [...] This includes every use of vectorized indexing with multiple integer >> arrays. >> >> This line concerns me. In scikit-image, we often do: >> >> rr, cc = coords.T # coords is an (n, 2) array of integer coordinates >> values = image[rr, cc] >> >> Are you saying that this use is deprecated? Because we love it at >> scikit-image. I would be very very very sad to lose this syntax. >> > > I second Juan's sentiments wholeheartedly here. > And thirded. This should not be considered deprecated or discouraged. As I mentioned in the previous iteration of this discussion, this is the behavior I want more often than the orthogonal indexing. It's a really common way to work with images and other kinds of raster data, so I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index`. It should not issue warnings or (eventual) errors. I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Tue Jun 26 03:11:57 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Tue, 26 Jun 2018 00:11:57 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: > I don't think it should be relegated to the "officially discouraged" ghetto of `.legacy_index` The way I read it, the new spelling lof that would be the explicit but not discouraged `image.vindex[rr, cc]`. > I would reserve warnings for the cases where the current behavior is something no one really wants, like mixing slices and integer arrays. These are the cases that would only be available under `legacy_index`. Eric On Mon, 25 Jun 2018 at 23:54 Robert Kern wrote: > On Mon, Jun 25, 2018 at 11:29 PM Andrew Nelson wrote: > >> On Tue, 26 Jun 2018 at 16:24, Juan Nunez-Iglesias >> wrote: >> >>> > Plain indexing arr[...] should return an error for ambiguous cases. >>> [...] This includes every use of vectorized indexing with multiple integer >>> arrays. >>> >>> This line concerns me. In scikit-image, we often do: >>> >>> rr, cc = coords.T # coords is an (n, 2) array of integer coordinates >>> values = image[rr, cc] >>> >>> Are you saying that this use is deprecated? Because we love it at >>> scikit-image. I would be very very very sad to lose this syntax. >>> >> >> I second Juan's sentiments wholeheartedly here. >> > > And thirded. This should not be considered deprecated or discouraged. As I > mentioned in the previous iteration of this discussion, this is the > behavior I want more often than the orthogonal indexing. It's a really > common way to work with images and other kinds of raster data, so I don't > think it should be relegated to the "officially discouraged" ghetto of > `.legacy_index`. It should not issue warnings or (eventual) errors. I would > reserve warnings for the cases where the current behavior is something no > one really wants, like mixing slices and integer arrays. > > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andyfaff at gmail.com Tue Jun 26 03:30:01 2018 From: andyfaff at gmail.com (Andrew Nelson) Date: Tue, 26 Jun 2018 17:30:01 +1000 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, 26 Jun 2018 at 17:12, Eric Wieser wrote: > > I don't think it should be relegated to the "officially discouraged" > ghetto of `.legacy_index` > > The way I read it, the new spelling lof that would be the explicit but not > discouraged `image.vindex[rr, cc]`. > If I'm understanding correctly what can be achieved now by `arr[rr, cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is a very large change in behaviour. I suspect that there a lot of situations out there which use `arr[idxs]` where `idxs` can mean one of a range of things depending on the code path followed. If any of those change, or a mix of nomenclatures are required to access the different cases, then havoc will probably ensue. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 03:46:02 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 00:46:02 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser wrote: > > I don't think it should be relegated to the "officially discouraged" > ghetto of `.legacy_index` > > The way I read it, the new spelling lof that would be the explicit but not > discouraged `image.vindex[rr, cc]`. > Okay, I missed that the first time through. I think having more self-contained descriptions of the semantics of each of these would be a good idea. The current description of `.vindex` spends more time talking about what it doesn't do, compared to the other methods, than what it does. Some more typical, less-exotic examples would be a good idea. > I would reserve warnings for the cases where the current behavior is > something no one really wants, like mixing slices and integer arrays. > > These are the cases that would only be available under `legacy_index`. > I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from. If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing). If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 03:54:43 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 00:54:43 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 12:46 AM Robert Kern wrote: > On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser > wrote: > >> > I would reserve warnings for the cases where the current behavior is >> something no one really wants, like mixing slices and integer arrays. >> >> These are the cases that would only be available under `legacy_index`. >> > > I'm still leaning towards not warning on current, unproblematic common > uses. It's unnecessary churn for currently working, understandable code. I > would still reserve warnings and deprecation for the cases where the > current behavior gives us something that no one wants. Those are the real > traps that people need to be warned away from. > > If someone is mixing slices and integer indices, that's a really good sign > that they thought indexing behaved in a different way (e.g. orthogonal > indexing). > > If someone is just using multiple index arrays that would currently not > give an error, that's actually a really good sign that they are using it > correctly and are getting the semantics that they desired. If they wanted > orthogonal indexing, it is *really* likely that their index arrays would > *not* broadcast together. And even if they did, the wrong shape of the > result is one of the more easily noticed things. These are not silent > errors that would motivate adding a new warning. > Of course, I would definitely support adding more information to the various IndexError messages to point people to `.oindex` and `.vindex`. I think that would guide more people to correct their code than adding a new warning to code that currently executes (which is likely not erroneous), and it would cause no churn. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jun 26 03:57:07 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 26 Jun 2018 09:57:07 +0200 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, 2018-06-26 at 17:30 +1000, Andrew Nelson wrote: > On Tue, 26 Jun 2018 at 17:12, Eric Wieser m> wrote: > > > I don't think it should be relegated to the "officially > > discouraged" ghetto of `.legacy_index` > > > > The way I read it, the new spelling lof that would be the explicit > > but not discouraged `image.vindex[rr, cc]`. > > > > If I'm understanding correctly what can be achieved now by `arr[rr, > cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is > a very large change in behaviour. I suspect that there a lot of > situations out there which use `arr[idxs]` where `idxs` can mean one > of a range of things depending on the code path followed. If any of > those change, or a mix of nomenclatures are required to access the > different cases, then havoc will probably ensue. Yes, that is true, but I doubt you will find a lot of code path that need the current indexing as opposed to vindex here, and the idea was to have a method to get the old behaviour indefinitely. You will need to add the `.vindex`, but that should be the only code change needed, and it would be easy to find where with errors/warnings. I see a possible problem with code that has to work on different numpy versions, but only in meaning we need to delay deprecations. The only thing I could imagine where this might happen is if you forward someone elses indexing objects and different users are used to different results. Otherwise, there is mostly one case which would get annoying, and that is `arr[:, rr, cc]` since `arr.vindex[:, rr, cc]` would not be exactly the same. Because, yes, in some cases the current logic is convenient, just incredibly surprising as well. - Sebastian > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From einstein.edison at gmail.com Tue Jun 26 04:01:24 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Tue, 26 Jun 2018 04:01:24 -0400 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: I second this design. If we were to consider the general case of a tuple `idx`, then we?d not be moving forward at all. Design changes would be impossible. I?d argue that this newer model would be easier for library maintainers overall (who are the kind of people using this), reducing maintenance cost in the long run because it?d lead to simpler code. I would also that the ?internal? classes expressing outer as vectorised indexing etc. should be exposed, for maintainers of duck arrays to use. God knows how many utility functions I?ve had to write to avoid relying on undocumented NumPy internals for pydata/sparse, fearing that I?d have to rewrite/modify them when behaviour changes or I find other corner cases. Best Regards, Hameer Abbasi Sent from Astro for Mac On 26. Jun 2018 at 09:46, Robert Kern wrote: On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser wrote: > > I don't think it should be relegated to the "officially discouraged" > ghetto of `.legacy_index` > > The way I read it, the new spelling lof that would be the explicit but not > discouraged `image.vindex[rr, cc]`. > Okay, I missed that the first time through. I think having more self-contained descriptions of the semantics of each of these would be a good idea. The current description of `.vindex` spends more time talking about what it doesn't do, compared to the other methods, than what it does. Some more typical, less-exotic examples would be a good idea. > I would reserve warnings for the cases where the current behavior is > something no one really wants, like mixing slices and integer arrays. > > These are the cases that would only be available under `legacy_index`. > I'm still leaning towards not warning on current, unproblematic common uses. It's unnecessary churn for currently working, understandable code. I would still reserve warnings and deprecation for the cases where the current behavior gives us something that no one wants. Those are the real traps that people need to be warned away from. If someone is mixing slices and integer indices, that's a really good sign that they thought indexing behaved in a different way (e.g. orthogonal indexing). If someone is just using multiple index arrays that would currently not give an error, that's actually a really good sign that they are using it correctly and are getting the semantics that they desired. If they wanted orthogonal indexing, it is *really* likely that their index arrays would *not* broadcast together. And even if they did, the wrong shape of the result is one of the more easily noticed things. These are not silent errors that would motivate adding a new warning. -- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 04:21:00 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 01:21:00 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg wrote: > On Tue, 2018-06-26 at 17:30 +1000, Andrew Nelson wrote: > > On Tue, 26 Jun 2018 at 17:12, Eric Wieser > m> wrote: > > > > I don't think it should be relegated to the "officially > > > discouraged" ghetto of `.legacy_index` > > > > > > The way I read it, the new spelling lof that would be the explicit > > > but not discouraged `image.vindex[rr, cc]`. > > > > > > > If I'm understanding correctly what can be achieved now by `arr[rr, > > cc]` would have to be modified to use `arr.vindex[rr, cc]`, which is > > a very large change in behaviour. I suspect that there a lot of > > situations out there which use `arr[idxs]` where `idxs` can mean one > > of a range of things depending on the code path followed. If any of > > those change, or a mix of nomenclatures are required to access the > > different cases, then havoc will probably ensue. > > Yes, that is true, but I doubt you will find a lot of code path that > need the current indexing as opposed to vindex here, That's probably true! But I think it's besides the point. I'd wager that most code paths that will use .vindex would work perfectly well with current indexing, too. Most of the time, people aren't getting into the hairy corners of advanced indexing. Adding to the toolbox is great, but I don't see a good reason to take out the ones that are commonly used quite safely. > and the idea was > to have a method to get the old behaviour indefinitely. You will need > to add the `.vindex`, but that should be the only code change needed, > and it would be easy to find where with errors/warnings. > It's not necessarily hard; it's just churn for no benefit to the downstream code. They didn't get a new feature; they just have to run faster to stay in the same place. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Tue Jun 26 04:23:23 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Tue, 26 Jun 2018 04:23:23 -0400 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: > Boolean indices are not supported. All indices must be integers, integer arrays or slices. I would hope that there?s at least some way to do boolean indexing. I often find myself needing it. I realise that `arr.vindex[np.nonzero(boolean_idx)]` works, but it is slightly too verbose for my liking. Maybe we can have `arr.bindex[boolean_index]` as an alias to exactly that? Or is boolean indexing preserved as-is n the newest proposal? If so, great! Another thing I?d say is `arr.?index` should be replaced with `arr.?idx`. I personally prefer `arr.?x` for my fingers but I realise that for someone not super into NumPy indexing, this is kind of opaque to read, so I propose this less verbose but hopefully equally clear version, for my (and others?) brains. Best Regards, Hameer Abbasi Sent from Astro for Mac -------------- next part -------------- An HTML attachment was scrubbed... URL: From teoliphant at gmail.com Tue Jun 26 04:24:06 2018 From: teoliphant at gmail.com (Travis Oliphant) Date: Tue, 26 Jun 2018 02:24:06 -0600 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: I like the proposal generally. NumPy could use a good orthogonal indexing method and a vectorized-indexing method is fine too. Robert Kern is spot on with his concerns as well. Please do not change what arr[idx] does except to provide warnings and perhaps point people to new .oix and .vix methods. What indexing does is documented (if hard to understand and surprising in a particular sub-case). There is one specific place in the code where I would make a change to raise an error rather than change the order of the axes of the output to provide a consistent subspace. Even then, it should be done as a deprecation warning and then raise the error. Otherwise, just add the new methods and don't make any other changes until a major release. -Travis On Tue, Jun 26, 2018 at 2:03 AM Hameer Abbasi wrote: > I second this design. If we were to consider the general case of a tuple > `idx`, then we?d not be moving forward at all. Design changes would be > impossible. I?d argue that this newer model would be easier for library > maintainers overall (who are the kind of people using this), reducing > maintenance cost in the long run because it?d lead to simpler code. > > I would also that the ?internal? classes expressing outer as vectorised > indexing etc. should be exposed, for maintainers of duck arrays to use. God > knows how many utility functions I?ve had to write to avoid relying on > undocumented NumPy internals for pydata/sparse, fearing that I?d have to > rewrite/modify them when behaviour changes or I find other corner cases. > > Best Regards, > Hameer Abbasi > Sent from Astro for Mac > > On 26. Jun 2018 at 09:46, Robert Kern wrote: > > > On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser > wrote: > >> > I don't think it should be relegated to the "officially discouraged" >> ghetto of `.legacy_index` >> >> The way I read it, the new spelling lof that would be the explicit but >> not discouraged `image.vindex[rr, cc]`. >> > > Okay, I missed that the first time through. I think having more > self-contained descriptions of the semantics of each of these would be a > good idea. The current description of `.vindex` spends more time talking > about what it doesn't do, compared to the other methods, than what it does. > > Some more typical, less-exotic examples would be a good idea. > > > I would reserve warnings for the cases where the current behavior is >> something no one really wants, like mixing slices and integer arrays. >> >> These are the cases that would only be available under `legacy_index`. >> > > I'm still leaning towards not warning on current, unproblematic common > uses. It's unnecessary churn for currently working, understandable code. I > would still reserve warnings and deprecation for the cases where the > current behavior gives us something that no one wants. Those are the real > traps that people need to be warned away from. > > If someone is mixing slices and integer indices, that's a really good sign > that they thought indexing behaved in a different way (e.g. orthogonal > indexing). > > If someone is just using multiple index arrays that would currently not > give an error, that's actually a really good sign that they are using it > correctly and are getting the semantics that they desired. If they wanted > orthogonal indexing, it is *really* likely that their index arrays would > *not* broadcast together. And even if they did, the wrong shape of the > result is one of the more easily noticed things. These are not silent > errors that would motivate adding a new warning. > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Tue Jun 26 04:28:09 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Tue, 26 Jun 2018 01:28:09 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: Another thing I?d say is arr.?index should be replaced with arr.?idx. Or perhaps arr.o_[] and arr.v_[], to match the style of our existing np.r_, np.c_, np.s_, etc? From robert.kern at gmail.com Tue Jun 26 04:33:15 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 01:33:15 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: On Tue, Jun 26, 2018 at 1:26 AM Travis Oliphant wrote: > I like the proposal generally. NumPy could use a good orthogonal indexing > method and a vectorized-indexing method is fine too. > > Robert Kern is spot on with his concerns as well. Please do not change > what arr[idx] does except to provide warnings and perhaps point people to > new .oix and .vix methods. What indexing does is documented (if hard to > understand and surprising in a particular sub-case). > > There is one specific place in the code where I would make a change to > raise an error rather than change the order of the axes of the output to > provide a consistent subspace. Even then, it should be done as a > deprecation warning and then raise the error. > > Otherwise, just add the new methods and don't make any other changes until > a major release. > I'd suggest that the NEP explicitly disclaim deprecating current behavior. Let the NEP just be about putting the new features out there. Once we have some experience with them for a year or three, then let's talk about deprecating parts of the current behavior and make a new NEP then if we want to go that route. We're only contemplating *long* deprecation cycles anyways; we're not in a race. The success of these new features doesn't really rely on the deprecation of current indexing, so let's separate those issues. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Tue Jun 26 04:34:37 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Tue, 26 Jun 2018 04:34:37 -0400 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: I actually had to think a lot, read docs, use SO and so on to realise what those meant the first time around, I didn?t understand them on sight. And I had to keep coming back to the docs from time to time as I wasn?t exactly using them too much (for exactly this reason, when some problems could be solved more simply by doing just that). I?d prefer something that sticks in your head and ?underscore? for ?indexing? didn't do that for me. Of course, this was my experience as a first-timer. I?d prefer not to up the learning curve for others in the same situation. An experienced user might disagree. :-) Best Regards, Hameer Abbasi Sent from Astro for Mac On 26. Jun 2018 at 10:28, Eric Wieser wrote: Another thing I?d say is arr.?index should be replaced with arr.?idx. Or perhaps arr.o_[] and arr.v_[], to match the style of our existing np.r_, np.c_, np.s_, etc? _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jun 26 04:35:20 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 26 Jun 2018 10:35:20 +0200 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> On Tue, 2018-06-26 at 01:21 -0700, Robert Kern wrote: > On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg > wrote: > > > > Yes, that is true, but I doubt you will find a lot of code path > > that > > need the current indexing as opposed to vindex here, > > That's probably true! But I think it's besides the point. I'd wager > that most code paths that will use .vindex would work perfectly well > with current indexing, too. Most of the time, people aren't getting > into the hairy corners of advanced indexing. > Right, the proposal was to have DeprecationWarnings when they differ, now I also thought DeprecationWarnings on two advanced indexes in general is good, because it is good for new users. I have to agree with your argument that most of the confused should be running into broadcast errors (if they expect oindex vs. fancy). So I see this as a point that we likely should just limit ourselves at least for now to the cases for example with sudden transposing going on. However, I would like to point out that the reason for the more broad warnings is that it could allow warping normal indexing at some point. Also it decreases traps with array-likes that behave differently. > Adding to the toolbox is great, but I don't see a good reason to take > out the ones that are commonly used quite safely. > > > and the idea was > > to have a method to get the old behaviour indefinitely. You will > > need > > to add the `.vindex`, but that should be the only code change > > needed, > > and it would be easy to find where with errors/warnings. > > It's not necessarily hard; it's just churn for no benefit to the > downstream code. They didn't get a new feature; they just have to run > faster to stay in the same place. > So, yes, it is annoying for quite a few projects that correctly use fancy indexing, but if we choose to not annoy you a little, we will have much less long term options which also includes such projects compatibility to new/current array-likes. So basically one point is: if we annoy scikit-image now, their code will work better for dask arrays in the future hopefully. - Sebastian > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From sebastian at sipsolutions.net Tue Jun 26 04:41:24 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 26 Jun 2018 10:41:24 +0200 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: <7ce86e31ffa2ad107c4a77b35b9b817562c54193.camel@sipsolutions.net> On Tue, 2018-06-26 at 04:23 -0400, Hameer Abbasi wrote: > > Boolean indices are not supported. All indices must be integers, > integer arrays or slices. > > I would hope that there?s at least some way to do boolean indexing. I > often find myself needing it. I realise that > `arr.vindex[np.nonzero(boolean_idx)]` works, but it is slightly too > verbose for my liking. Maybe we can have `arr.bindex[boolean_index]` > as an alias to exactly that? > That part is limited to `vindex` only. A single boolean index would always work in plain indexing and you can mix it all up inside of `oindex`. But with fancy indexing mixing boolean + integer seems currently pretty much useless (and thus the same is true for `vindex`, in `oindex` things make sense). Now you could invent some new logic for such a mixing case in `vindex`, but it seems easier to just ignore it for the moment. - Sebastian > Or is boolean indexing preserved as-is n the newest proposal? If so, > great! > > Another thing I?d say is `arr.?index` should be replaced with > `arr.?idx`. I personally prefer `arr.?x` for my fingers but I realise > that for someone not super into NumPy indexing, this is kind of > opaque to read, so I propose this less verbose but hopefully equally > clear version, for my (and others?) brains. > > Best Regards, > Hameer Abbasi > Sent from Astro for Mac > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From einstein.edison at gmail.com Tue Jun 26 04:48:22 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Tue, 26 Jun 2018 04:48:22 -0400 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: I would disagree here. For libraries like Dask, XArray, pydata/sparse, XND, etc., it would be bad for them if there was continued use of ?weird? indexing behaviour (no warnings means more code written that?s? well? not exactly the best design). Of course, we could just choose to not support it. But that means a lot of code won?t support us, or support us later than we desire. I agree with your design of ?let?s limit the number of warnings/deprecations to cases that make very little sense? but there should be warnings. Specifically, I recommend warnings for mixed slices and fancy indexes, and warnings followed by errors for cases where the transposing behaviour occurs. Best Regards, Hameer Abbasi Sent from Astro for Mac On 26. Jun 2018 at 10:33, Robert Kern wrote: On Tue, Jun 26, 2018 at 1:26 AM Travis Oliphant wrote: > I like the proposal generally. NumPy could use a good orthogonal indexing > method and a vectorized-indexing method is fine too. > > Robert Kern is spot on with his concerns as well. Please do not change > what arr[idx] does except to provide warnings and perhaps point people to > new .oix and .vix methods. What indexing does is documented (if hard to > understand and surprising in a particular sub-case). > > There is one specific place in the code where I would make a change to > raise an error rather than change the order of the axes of the output to > provide a consistent subspace. Even then, it should be done as a > deprecation warning and then raise the error. > > Otherwise, just add the new methods and don't make any other changes until > a major release. > I'd suggest that the NEP explicitly disclaim deprecating current behavior. Let the NEP just be about putting the new features out there. Once we have some experience with them for a year or three, then let's talk about deprecating parts of the current behavior and make a new NEP then if we want to go that route. We're only contemplating *long* deprecation cycles anyways; we're not in a race. The success of these new features doesn't really rely on the deprecation of current indexing, so let's separate those issues. -- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 04:57:54 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 01:57:54 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: On Tue, Jun 26, 2018 at 1:49 AM Hameer Abbasi wrote: > > On 26. Jun 2018 at 10:33, Robert Kern wrote: > > > I'd suggest that the NEP explicitly disclaim deprecating current behavior. Let the NEP just be about putting the new features out there. Once we have some experience with them for a year or three, then let's talk about deprecating parts of the current behavior and make a new NEP then if we want to go that route. We're only contemplating *long* deprecation cycles anyways; we're not in a race. The success of these new features doesn't really rely on the deprecation of current indexing, so let's separate those issues. > > I would disagree here. For libraries like Dask, XArray, pydata/sparse, XND, etc., it would be bad for them if there was continued use of ?weird? indexing behaviour (no warnings means more code written that?s? well? not exactly the best design). Of course, we could just choose to not support it. But that means a lot of code won?t support us, or support us later than we desire. > > I agree with your design of ?let?s limit the number of warnings/deprecations to cases that make very little sense? but there should be warnings. I'm still in favor of warnings in these cases. I didn't mean to suggest excluding those from the NEP. I just don't think they should be deprecations; we shouldn't suggest that they will eventually turn into errors. At least until we get these features out there, get some experience with them, then have a new NEP at that time just proposing deprecation. P.S. Would you mind bottom-posting? It helps maintain the context of what you are commenting on and my reply to those comments. I tried writing this reply without it, and it felt like it was missing context. Thanks! -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 05:27:22 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 02:27:22 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 1:36 AM Sebastian Berg wrote: > On Tue, 2018-06-26 at 01:21 -0700, Robert Kern wrote: > > On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg > > wrote: > > > > > > > > > Yes, that is true, but I doubt you will find a lot of code path > > > that > > > need the current indexing as opposed to vindex here, > > > > That's probably true! But I think it's besides the point. I'd wager > > that most code paths that will use .vindex would work perfectly well > > with current indexing, too. Most of the time, people aren't getting > > into the hairy corners of advanced indexing. > > > > Right, the proposal was to have DeprecationWarnings when they differ, > now I also thought DeprecationWarnings on two advanced indexes in > general is good, because it is good for new users. > I have to agree with your argument that most of the confused should be > running into broadcast errors (if they expect oindex vs. fancy). So I > see this as a point that we likely should just limit ourselves at least > for now to the cases for example with sudden transposing going on. > > However, I would like to point out that the reason for the more broad > warnings is that it could allow warping normal indexing at some point. > I don't really understand this. You would discourage the "normal" syntax in favor of these more specific named syntaxes, so you can introduce different behavior for the "normal" syntax and encourage everyone to use it again? Just add more named syntaxes if you want new behavior! That's the beauty of the design underlying this NEP. > Also it decreases traps with array-likes that behave differently. > If we were to take this seriously, then no one should use a bare [] ever. I'll go on record as saying that array-likes should respond to `a[rr, cc]`, as in Juan's example, with the current behavior. And if they don't, they don't deserve to be operated on by skimage functions. If I'm reading the NEP correctly, the main thrust of the issue with array-likes is that it is difficult for some of them to implement the full spectrum of indexing possibilities. This NEP does not actually make it *easier* for those array-likes to implement every possibility. It just offers some APIs that more naturally express common use cases which can sometimes be implemented more naturally than if expressed in the current indexing. For instance, you can achieve the same effect as orthogonal indexing with the current implementation, but you have to manipulate the indices before you pass them over to __getitem__(), losing information along the way that could be used to make a more efficient lookup in some array-likes. The NEP design is essentially more of a way to give these array-likes standard places to raise NotImplementedError than it is to help them get rid of all of their NotImplementedErrors. More specifically, if these array-likes can't implement `a[rr, cc]`, they're not going to implement `a.vindex[rr, cc]`, either. I think most of the problems that caused these libraries to make different choices in their __getitem__() implementation are due to the fact that these expressive APIs didn't exist, so they had to shoehorn them into __getitem__(); orthogonal indexing was too useful and efficient not to implement! I think that once we have .oindex and .vindex out there, they will be able to clean up their __getitem__()s to consistently support whatever of the current behavior that they can and raise NotImplementedError where they can't. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jun 26 06:48:11 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 26 Jun 2018 12:48:11 +0200 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, 2018-06-26 at 02:27 -0700, Robert Kern wrote: > On Tue, Jun 26, 2018 at 1:36 AM Sebastian Berg s.net> wrote: > > On Tue, 2018-06-26 at 01:21 -0700, Robert Kern wrote: > > > On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg > > > wrote: > > > > > > > > > > > > > > Yes, that is true, but I doubt you will find a lot of code path > > > > that > > > > need the current indexing as opposed to vindex here, > > > > > > That's probably true! But I think it's besides the point. I'd > > wager > > > that most code paths that will use .vindex would work perfectly > > well > > > with current indexing, too. Most of the time, people aren't > > getting > > > into the hairy corners of advanced indexing. > > > > > > > Right, the proposal was to have DeprecationWarnings when they > > differ, > > now I also thought DeprecationWarnings on two advanced indexes in > > general is good, because it is good for new users. > > I have to agree with your argument that most of the confused should > > be > > running into broadcast errors (if they expect oindex vs. fancy). So > > I > > see this as a point that we likely should just limit ourselves at > > least > > for now to the cases for example with sudden transposing going on. > > > > However, I would like to point out that the reason for the more > > broad > > warnings is that it could allow warping normal indexing at some > > point. > > > > I don't really understand this. You would discourage the "normal" > syntax in favor of these more specific named syntaxes, so you can > introduce different behavior for the "normal" syntax and encourage > everyone to use it again? Just add more named syntaxes if you want > new behavior! That's the beauty of the design underlying this NEP. > > > Also it decreases traps with array-likes that behave differently. > > If we were to take this seriously, then no one should use a bare [] > ever. > > I'll go on record as saying that array-likes should respond to `a[rr, > cc]`, as in Juan's example, with the current behavior. And if they > don't, they don't deserve to be operated on by skimage functions. > > If I'm reading the NEP correctly, the main thrust of the issue with > array-likes is that it is difficult for some of them to implement the > full spectrum of indexing possibilities. This NEP does not actually > make it *easier* for those array-likes to implement every > possibility. It just offers some APIs that more naturally express > common use cases which can sometimes be implemented more naturally > than if expressed in the current indexing. For instance, you can > achieve the same effect as orthogonal indexing with the current > implementation, but you have to manipulate the indices before you > pass them over to __getitem__(), losing information along the way > that could be used to make a more efficient lookup in some array- > likes. > > The NEP design is essentially more of a way to give these array-likes > standard places to raise NotImplementedError than it is to help them > get rid of all of their NotImplementedErrors. More specifically, if > these array-likes can't implement `a[rr, cc]`, they're not going to > implement `a.vindex[rr, cc]`, either. > > I think most of the problems that caused these libraries to make > different choices in their __getitem__() implementation are due to > the fact that these expressive APIs didn't exist, so they had to > shoehorn them into __getitem__(); orthogonal indexing was too useful > and efficient not to implement! I think that once we have .oindex and > .vindex out there, they will be able to clean up their __getitem__()s > to consistently support whatever of the current behavior that they > can and raise NotImplementedError where they can't. > Right, it helps mostly to be clear about what an object can and cannot do. So h5py or whatever could error out for plain indexing and only support `.oindex`, and we have all options cleanly available. And yes, I agree that in itself is a big step forward. The thing is there are also very strong opinions that the fancy indexing behaviour is so confusing that it would ideally not be the default since it breaks comparing analogy slice objects. So, personally, I would argue that if we were to start over from scratch, fancy indexing (multiple indexes), would not be the default plain indexing behaviour. Now, maybe the pain of a few warnings is too high, but if we wish to move, no matter how slowly, in such regard, we will have to swallow it eventually. The suggestion was to make that as easy as possible with adding an attribute indefinitely. Otherwise, even a possible numpy replacement might have difficulties to chose a different default for indexing for years to come... Practically, I guess some warnings might have to wait a longer while, just because it could be almost impossible to avoid them in code working with different numpy versions. - Sebastian > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From kevin.k.sheppard at gmail.com Tue Jun 26 06:50:20 2018 From: kevin.k.sheppard at gmail.com (Kevin Sheppard) Date: Tue, 26 Jun 2018 10:50:20 +0000 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: This seems like progress and a clear method to outer indexing will help many users. As for names, I prefer .ox and .vx for shorthand of .oindex and .vindex. I don?t like the .ox_ or .o_ syntax. Before any deprecation warnings or any other warnings are added it would be helpful to have some way to set a flag on Python to show some sort of HiddenDeprecationWarning (or OnlyShowIfFlagPassesDeprecationWarning) that would automatically be filtered by default but could be shown if someone was interested. This will allow library writers to see problems before any start showing up for users. These could then be promoted to Visible or Future later. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Tue Jun 26 11:03:19 2018 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 26 Jun 2018 17:03:19 +0200 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: On Tue, 2018-06-26 at 04:01 -0400, Hameer Abbasi wrote: > I second this design. If we were to consider the general case of a > tuple `idx`, then we?d not be moving forward at all. Design changes > would be impossible. I?d argue that this newer model would be easier > for library maintainers overall (who are the kind of people using > this), reducing maintenance cost in the long run because it?d lead to > simpler code. > > I would also that the ?internal? classes expressing outer as > vectorised indexing etc. should be exposed, for maintainers of duck > arrays to use. God knows how many utility functions I?ve had to write > to avoid relying on undocumented NumPy internals for pydata/sparse, > fearing that I?d have to rewrite/modify them when behaviour changes > or I find other corner cases. Could you list some examples what you would need? We can expose some of the internals, or maybe even provide funcs to map e.g. oindex to vindex or vindex to plain indexing, etc. but it would be helpful to know what downstream actually might need. For all I know the things that you are thinking of may not even exist... - Sebastian > > Best Regards, > Hameer Abbasi > Sent from Astro for Mac > > > On 26. Jun 2018 at 09:46, Robert Kern > > wrote: > > > > On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser > il.com> wrote: > > > > I don't think it should be relegated to the "officially > > > discouraged" ghetto of `.legacy_index` > > > > > > The way I read it, the new spelling lof that would be the > > > explicit but not discouraged `image.vindex[rr, cc]`. > > > > > > > Okay, I missed that the first time through. I think having more > > self-contained descriptions of the semantics of each of these would > > be a good idea. The current description of `.vindex` spends more > > time talking about what it doesn't do, compared to the other > > methods, than what it does. > > > > Some more typical, less-exotic examples would be a good idea. > > > > > > I would reserve warnings for the cases where the current > > > behavior is something no one really wants, like mixing slices and > > > integer arrays. > > > > > > These are the cases that would only be available under > > > `legacy_index`. > > > > > > > I'm still leaning towards not warning on current, unproblematic > > common uses. It's unnecessary churn for currently working, > > understandable code. I would still reserve warnings and deprecation > > for the cases where the current behavior gives us something that no > > one wants. Those are the real traps that people need to be warned > > away from. > > > > If someone is mixing slices and integer indices, that's a really > > good sign that they thought indexing behaved in a different way > > (e.g. orthogonal indexing). > > > > If someone is just using multiple index arrays that would currently > > not give an error, that's actually a really good sign that they are > > using it correctly and are getting the semantics that they desired. > > If they wanted orthogonal indexing, it is *really* likely that > > their index arrays would *not* broadcast together. And even if they > > did, the wrong shape of the result is one of the more easily > > noticed things. These are not silent errors that would motivate > > adding a new warning. > > > > -- > > Robert Kern > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From wieser.eric+numpy at gmail.com Tue Jun 26 12:36:39 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Tue, 26 Jun 2018 09:36:39 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: We can expose some of the internals These could be expressed as methods on the internal indexing objects I proposed in the first reply to this thread, which has seen no responses. I think Hameer Abbasi is looking for something like OrthogonalIndexer(...).to_vindex() -> VectorizedIndexer such that arr.oindex[ind] selects the same elements as arr.vindex[OrthogonalIndexer(ind).to_vindex()] Eric ? On Tue, 26 Jun 2018 at 08:04 Sebastian Berg wrote: > On Tue, 2018-06-26 at 04:01 -0400, Hameer Abbasi wrote: > > I second this design. If we were to consider the general case of a > > tuple `idx`, then we?d not be moving forward at all. Design changes > > would be impossible. I?d argue that this newer model would be easier > > for library maintainers overall (who are the kind of people using > > this), reducing maintenance cost in the long run because it?d lead to > > simpler code. > > > > I would also that the ?internal? classes expressing outer as > > vectorised indexing etc. should be exposed, for maintainers of duck > > arrays to use. God knows how many utility functions I?ve had to write > > to avoid relying on undocumented NumPy internals for pydata/sparse, > > fearing that I?d have to rewrite/modify them when behaviour changes > > or I find other corner cases. > > Could you list some examples what you would need? We can expose some of > the internals, or maybe even provide funcs to map e.g. oindex to vindex > or vindex to plain indexing, etc. but it would be helpful to know what > downstream actually might need. For all I know the things that you are > thinking of may not even exist... > > - Sebastian > > > > > > > Best Regards, > > Hameer Abbasi > > Sent from Astro for Mac > > > > > On 26. Jun 2018 at 09:46, Robert Kern > > > wrote: > > > > > > On Tue, Jun 26, 2018 at 12:13 AM Eric Wieser > > il.com> wrote: > > > > > I don't think it should be relegated to the "officially > > > > discouraged" ghetto of `.legacy_index` > > > > > > > > The way I read it, the new spelling lof that would be the > > > > explicit but not discouraged `image.vindex[rr, cc]`. > > > > > > > > > > Okay, I missed that the first time through. I think having more > > > self-contained descriptions of the semantics of each of these would > > > be a good idea. The current description of `.vindex` spends more > > > time talking about what it doesn't do, compared to the other > > > methods, than what it does. > > > > > > Some more typical, less-exotic examples would be a good idea. > > > > > > > > I would reserve warnings for the cases where the current > > > > behavior is something no one really wants, like mixing slices and > > > > integer arrays. > > > > > > > > These are the cases that would only be available under > > > > `legacy_index`. > > > > > > > > > > I'm still leaning towards not warning on current, unproblematic > > > common uses. It's unnecessary churn for currently working, > > > understandable code. I would still reserve warnings and deprecation > > > for the cases where the current behavior gives us something that no > > > one wants. Those are the real traps that people need to be warned > > > away from. > > > > > > If someone is mixing slices and integer indices, that's a really > > > good sign that they thought indexing behaved in a different way > > > (e.g. orthogonal indexing). > > > > > > If someone is just using multiple index arrays that would currently > > > not give an error, that's actually a really good sign that they are > > > using it correctly and are getting the semantics that they desired. > > > If they wanted orthogonal indexing, it is *really* likely that > > > their index arrays would *not* broadcast together. And even if they > > > did, the wrong shape of the result is one of the more easily > > > noticed things. These are not silent errors that would motivate > > > adding a new warning. > > > > > > -- > > > Robert Kern > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion at python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion at python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Tue Jun 26 14:25:25 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Tue, 26 Jun 2018 14:25:25 -0400 Subject: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Hi All, Matti asked me to make a PR accepting my own NEP - https://github.com/numpy/numpy/pull/11429 Any objections? As noted in my earlier summary of the discussion, in principle we can choose to accept only parts, although I think it became clear that the most contentious is also the one arguably most needed, the flexible dimensions for matmul. Moving forward has the advantage that in 1.16 we will actually be able to deal with matmul. All the best, Marten On Fri, Jun 15, 2018 at 2:17 PM, Stephan Hoyer wrote: > On Mon, Jun 11, 2018 at 11:59 PM Eric Wieser > wrote: > >> I don?t understand your alternative here. If we overload np.matmul using >> *array_function*, then it would not use *ether* of these options for >> writing the operation in terms of other gufuncs. It would simply look for >> an *array_function* attribute, and call that method instead. >> >> Let me explain that suggestion a little more clearly. >> >> 1. There?d be a linalg.matmul2d that performs the real matrix case, >> which would be easy to make as a ufunc right now. >> 2. __matmul__ and __rmatmul__ would just call np.matmul, as they >> currently do (for consistency between np.matmul and operator.matmul, >> needed in python pre- at -operator) >> 3. np.matmul would be implemented as: >> >> @do_array_function_overridesdef matmul(a, b): >> if a.ndim != 1 and b.ndim != 1: >> return matmul2d(a, b) >> elif a.ndim != 1: >> return matmul2d(a, b[:,None])[...,0] >> elif b.ndim != 1: >> return matmul2d(a[None,:], b) >> else: >> # this one probably deserves its own ufunf >> return matmul2d(a[None,:], b[:,None])[0,0] >> >> 4. Quantity can just override __array_ufunc__ as with any other ufunc >> 5. DataArray, knowing the above doesn?t work, would implement >> something like >> >> @matmul.register_array_function(DataArray)def __array_function__(a, b): >> if a.ndim != 1 and b.ndim != 1: >> return matmul2d(a, b) >> else: >> # either: >> # - add/remove dummy dimensions in a dataarray-specific way >> # - downcast to ndarray and do the dimension juggling there >> >> >> Advantages of this approach: >> >> - >> >> Neither the ufunc machinery, nor __array_ufunc__, nor the inner loop, >> need to know about optional dimensions. >> - >> >> We get a matmul2d ufunc, that all subclasses support out of the box >> if they support matmul >> >> Eric >> > OK, this sounds pretty reasonable to me -- assuming we manage to figure > out the __array_function__ proposal! > > There's one additional ingredient we would need to make this work well: > some way to guarantee that "ndim" and indexing operations are available > without casting to a base numpy array. > > For now, np.asanyarray() would probably suffice, but that isn't quite > right (e.g., this would fail for np.matrix). > > In the long term, I think we need a new coercion protocol for "duck" > arrays. Nathaniel Smith and I started writing a NEP on this, but it isn't > quite ready yet. > >> ? >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Jun 26 17:43:30 2018 From: matti.picus at gmail.com (Matti Picus) Date: Tue, 26 Jun 2018 14:43:30 -0700 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On 19/06/18 10:57, Matthew Brett wrote: > Hi, > > On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus wrote: >> On 19/06/18 09:58, Charles R Harris wrote: >>>> What I was curious about is that there were no more "daily" builds of >>>> master. >>> Is that right? That there were daily builds of master, on Appveyor? >>> I don't know how those worked, I only recently got cron permission ... >> >> No, but there used to be daily builds on travis. They stopped 8 days ago, >> https://travis-ci.org/MacPython/numpy-wheels/builds. > Oops - yes - sorry - I retired the 'daily' branch, in favor of > 'master', but forgot to update the Travis-CI settings. > > Done now. > > Cheers, > > Matthew > FWIW, still no daily builds at https://travis-ci.org/MacPython/numpy-wheels/builds Matti From matthew.brett at gmail.com Tue Jun 26 17:55:13 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Tue, 26 Jun 2018 22:55:13 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: Hi, On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus wrote: > On 19/06/18 10:57, Matthew Brett wrote: >> >> Hi, >> >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus >> wrote: >>> >>> On 19/06/18 09:58, Charles R Harris wrote: >>>>> >>>>> What I was curious about is that there were no more "daily" builds of >>>>> master. >>>> >>>> Is that right? That there were daily builds of master, on Appveyor? >>>> I don't know how those worked, I only recently got cron permission ... >>> >>> >>> No, but there used to be daily builds on travis. They stopped 8 days ago, >>> https://travis-ci.org/MacPython/numpy-wheels/builds. >> >> Oops - yes - sorry - I retired the 'daily' branch, in favor of >> 'master', but forgot to update the Travis-CI settings. >> >> Done now. >> >> Cheers, >> >> Matthew >> > FWIW, still no daily builds at > https://travis-ci.org/MacPython/numpy-wheels/builds You mean, some days there appears to be no build? The build matrix does show Cron-triggered jobs, the last of which was a few hours ago: https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 Cheers, Matthew From robert.kern at gmail.com Tue Jun 26 19:32:21 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 16:32:21 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 3:50 AM Sebastian Berg wrote: > > On Tue, 2018-06-26 at 02:27 -0700, Robert Kern wrote: > > On Tue, Jun 26, 2018 at 1:36 AM Sebastian Berg > s.net> wrote: > > > On Tue, 2018-06-26 at 01:21 -0700, Robert Kern wrote: > > > > On Tue, Jun 26, 2018 at 12:58 AM Sebastian Berg > > > > wrote: > > > > > > > > > > > > > > > > > > > Yes, that is true, but I doubt you will find a lot of code path > > > > > that > > > > > need the current indexing as opposed to vindex here, > > > > > > > > That's probably true! But I think it's besides the point. I'd > > > wager > > > > that most code paths that will use .vindex would work perfectly > > > well > > > > with current indexing, too. Most of the time, people aren't > > > getting > > > > into the hairy corners of advanced indexing. > > > > > > > > > > Right, the proposal was to have DeprecationWarnings when they > > > differ, > > > now I also thought DeprecationWarnings on two advanced indexes in > > > general is good, because it is good for new users. > > > I have to agree with your argument that most of the confused should > > > be > > > running into broadcast errors (if they expect oindex vs. fancy). So > > > I > > > see this as a point that we likely should just limit ourselves at > > > least > > > for now to the cases for example with sudden transposing going on. > > > > > > However, I would like to point out that the reason for the more > > > broad > > > warnings is that it could allow warping normal indexing at some > > > point. > > > > > > > I don't really understand this. You would discourage the "normal" > > syntax in favor of these more specific named syntaxes, so you can > > introduce different behavior for the "normal" syntax and encourage > > everyone to use it again? Just add more named syntaxes if you want > > new behavior! That's the beauty of the design underlying this NEP. > > > > > Also it decreases traps with array-likes that behave differently. > > > > If we were to take this seriously, then no one should use a bare [] > > ever. > > > > I'll go on record as saying that array-likes should respond to `a[rr, > > cc]`, as in Juan's example, with the current behavior. And if they > > don't, they don't deserve to be operated on by skimage functions. > > > > If I'm reading the NEP correctly, the main thrust of the issue with > > array-likes is that it is difficult for some of them to implement the > > full spectrum of indexing possibilities. This NEP does not actually > > make it *easier* for those array-likes to implement every > > possibility. It just offers some APIs that more naturally express > > common use cases which can sometimes be implemented more naturally > > than if expressed in the current indexing. For instance, you can > > achieve the same effect as orthogonal indexing with the current > > implementation, but you have to manipulate the indices before you > > pass them over to __getitem__(), losing information along the way > > that could be used to make a more efficient lookup in some array- > > likes. > > > > The NEP design is essentially more of a way to give these array-likes > > standard places to raise NotImplementedError than it is to help them > > get rid of all of their NotImplementedErrors. More specifically, if > > these array-likes can't implement `a[rr, cc]`, they're not going to > > implement `a.vindex[rr, cc]`, either. > > > > I think most of the problems that caused these libraries to make > > different choices in their __getitem__() implementation are due to > > the fact that these expressive APIs didn't exist, so they had to > > shoehorn them into __getitem__(); orthogonal indexing was too useful > > and efficient not to implement! I think that once we have .oindex and > > .vindex out there, they will be able to clean up their __getitem__()s > > to consistently support whatever of the current behavior that they > > can and raise NotImplementedError where they can't. > > > > Right, it helps mostly to be clear about what an object can and cannot > do. So h5py or whatever could error out for plain indexing and only > support `.oindex`, and we have all options cleanly available. > > And yes, I agree that in itself is a big step forward. Okay, great. Before we move on to your next point, can we agree that the array-likes aren't a motivating factor for deprecating the current behavior of __getitem__()? > The thing is there are also very strong opinions that the fancy > indexing behaviour is so confusing that it would ideally not be the > default since it breaks comparing analogy slice objects. > > So, personally, I would argue that if we were to start over from > scratch, fancy indexing (multiple indexes), would not be the default > plain indexing behaviour. > Now, maybe the pain of a few warnings is too high, but if we wish to > move, no matter how slowly, in such regard, we will have to swallow it > eventually. > The suggestion was to make that as easy as possible with adding an > attribute indefinitely. > Otherwise, even a possible numpy replacement might have difficulties to > chose a different default for indexing for years to come... So I think we've moved past the technical objections. In the post-NEP .oindex/.vindex order, everyone can get the behavior that they want. Your argument for deprecation is now just about what the default is, the semantics that get pride of place with the shortest spelling. I am sympathetic to the feeling like you wish you had a time machine to go fix a design with your new insight. But it seems to me that just changing which semantics are the default has relatively attenuated value while breaking compatibility for a fundamental feature of numpy has significant costs. Just introducing .oindex is the bulk of the value of this NEP. Everything else is window dressing. You have my sympathies, but not enough for me to consent to deprecation. You might get more of my sympathy a year or two from now when the community has had a chance to work with .oindex. It's entirely possible that everyone will leap to using .oindex (and .vindex only rarely), and we will be flooded with complaints that "I only use .oindex, but the name is so long it messes up the readability of my lengthy expressions". But it's also possible that it sort of fizzles: people use it, but maybe use .vindex more, or about the same. Or just keep on happily using neither. We don't know which of those futures are going to be true. Anecdatally, you want .oindex semantics most often; I would almost exclusively use .vindex. I don't know which of us is more representative. Probably neither. I maintain that considering deprecation is premature at this time. Please take it out of this NEP. Let us get a feel for how people actually use .oindex/.vindex. Then we can talk about deprecation. This NEP gets my enthusiastic approval, except for the deprecation. I will be happy to talk about deprecation with an open mind in a few years. With some more actual experience under our belt, rather than prediction and theory, we can be more confident about the approach we want to take. Deprecation is not a fundamental part of this NEP and can be decided independently at a later time. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 26 21:13:25 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jun 2018 18:13:25 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 4:34 PM Robert Kern wrote: > I maintain that considering deprecation is premature at this time. Please > take it out of this NEP. Let us get a feel for how people actually use > .oindex/.vindex. Then we can talk about deprecation. This NEP gets my > enthusiastic approval, except for the deprecation. I will be happy to talk > about deprecation with an open mind in a few years. With some more actual > experience under our belt, rather than prediction and theory, we can be > more confident about the approach we want to take. Deprecation is not a > fundamental part of this NEP and can be decided independently at a later > time. > I agree, we should scale back most of the deprecations proposed in this NEP, leaving them for possible future work. In particular, you're not convinced yet that "outer indexing" is a more intuitive default indexing mode than "vectorized indexing", so it is premature to deprecate vectorized indexing behavior that conflicts with outer indexing. OK, fair enough. I would still like to include at least two more limited form of deprecation that I hope will be less controversial: - Mixed boolean/integer array indexing. This is not very intuitive nor useful, and I don't think I've ever seen it used. Usually "outer indexing" behavior is what is desired here. - Mixed array/slice indexing, for cases with arrays separated by slices so NumPy can't do the "intuitive" transpose on the output. As noted in the NEP, this is a common source of bugs. Users who want this should really switch to vindex. In the long term, although I agree with Sebastian that "outer indexing" is more intuitive for default indexing behavior, I would really like to eliminate the "dimension reordering" behavior of mixed array/slice indexing altogether. This is a weird special case that is different between indexing like array[...] from array.vindex[...]. So if we don't choose to deprecate all cases where [] and oindex[] are different, I would at least like to deprecate all cases where [] and vindex[] are different. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 26 21:22:24 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jun 2018 18:22:24 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: Message-ID: On Tue, Jun 26, 2018 at 9:38 AM Eric Wieser wrote: > We can expose some of the internals > > These could be expressed as methods on the internal indexing objects I > proposed in the first reply to this thread, which has seen no responses. > > I think Hameer Abbasi is looking for something like OrthogonalIndexer(...).to_vindex() > -> VectorizedIndexer such that arr.oindex[ind] selects the same elements > as arr.vindex[OrthogonalIndexer(ind).to_vindex()] > > Eric > It is probably worth noting that xarray already uses very similar classes internally for keeping track of indexing operations. See BasicIndexer, OuterIndexer and VectorizedIndexer: https://github.com/pydata/xarray/blob/v0.10.7/xarray/core/indexing.py#L295-L428 This turns out to be pretty convenient model even when not using subclassing. In xarray, we use them internally in various "partial duck array" classes that do some lazy computation upon indexing with __getitem__. It's nice to simply be able to forward on Indexer objects rather than implement separate vindex/oindex methods. We also have utility functions for converting between different forms, e.g., from OuterIndexer to VectorizedIndexer: https://github.com/pydata/xarray/blob/v0.10.7/xarray/core/indexing.py#L654 I guess this is a case for using such classes internally in NumPy, and possibly for exposing them publicly as well. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 21:38:44 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 18:38:44 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 6:14 PM Stephan Hoyer wrote: > On Tue, Jun 26, 2018 at 4:34 PM Robert Kern wrote: > >> I maintain that considering deprecation is premature at this time. Please >> take it out of this NEP. Let us get a feel for how people actually use >> .oindex/.vindex. Then we can talk about deprecation. This NEP gets my >> enthusiastic approval, except for the deprecation. I will be happy to talk >> about deprecation with an open mind in a few years. With some more actual >> experience under our belt, rather than prediction and theory, we can be >> more confident about the approach we want to take. Deprecation is not a >> fundamental part of this NEP and can be decided independently at a later >> time. >> > > I agree, we should scale back most of the deprecations proposed in this > NEP, leaving them for possible future work. In particular, you're not > convinced yet that "outer indexing" is a more intuitive default indexing > mode than "vectorized indexing", so it is premature to deprecate vectorized > indexing behavior that conflicts with outer indexing. OK, fair enough. > Actually, I do think outer indexing is more "intuitive"*, as far as that goes. It's just rarely what I actually want to accomplish. * I do not like using "intuitive" in programming. Nipples are intuitive. Everything else is learned. But in this case, I think that outer indexing is a more concordant extension of the concepts that a new numpy user would have learned earlier: integer indices and slices. I would still like to include at least two more limited form of deprecation > that I hope will be less controversial: > - Mixed boolean/integer array indexing. This is not very intuitive nor > useful, and I don't think I've ever seen it used. Usually "outer indexing" > behavior is what is desired here. > - Mixed array/slice indexing, for cases with arrays separated by slices so > NumPy can't do the "intuitive" transpose on the output. As noted in the > NEP, this is a common source of bugs. Users who want this should really > switch to vindex. > I'd still prefer not talking deprecation, per se, in this NEP (but my objection is weaker). I would definitely start adding in informative, noisy warnings in these cases, though. Along the lines of, "Hey, this is a dodgy construction that typically gives unexpected results. Here are .oindex/.vindex that might do what you actually want, but you can use .legacy_index if you just want to silence this warning". Rather than "Hey, this is going to go away at some point." -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 26 21:45:49 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jun 2018 18:45:49 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 6:39 PM Robert Kern wrote: > I'd still prefer not talking deprecation, per se, in this NEP (but my > objection is weaker). I would definitely start adding in informative, noisy > warnings in these cases, though. Along the lines of, "Hey, this is a dodgy > construction that typically gives unexpected results. Here are > .oindex/.vindex that might do what you actually want, but you can use > .legacy_index if you just want to silence this warning". Rather than "Hey, > this is going to go away at some point." > Yes, agreed. These will use a new warning class, perhaps numpy.IndexingWarning. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 26 21:54:14 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 18:54:14 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: On Tue, Jun 26, 2018 at 6:47 PM Stephan Hoyer wrote: > > On Tue, Jun 26, 2018 at 6:39 PM Robert Kern wrote: > >> I'd still prefer not talking deprecation, per se, in this NEP (but my >> objection is weaker). I would definitely start adding in informative, noisy >> warnings in these cases, though. Along the lines of, "Hey, this is a dodgy >> construction that typically gives unexpected results. Here are >> .oindex/.vindex that might do what you actually want, but you can use >> .legacy_index if you just want to silence this warning". Rather than "Hey, >> this is going to go away at some point." >> > > Yes, agreed. These will use a new warning class, perhaps > numpy.IndexingWarning. > Perfect. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jun 27 00:48:40 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jun 2018 21:48:40 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 12:46 AM Robert Kern wrote: > I think having more self-contained descriptions of the semantics of each > of these would be a good idea. The current description of `.vindex` spends > more time talking about what it doesn't do, compared to the other methods, > than what it does. > Will do. > I'm still leaning towards not warning on current, unproblematic common > uses. It's unnecessary churn for currently working, understandable code. I > would still reserve warnings and deprecation for the cases where the > current behavior gives us something that no one wants. Those are the real > traps that people need to be warned away from. > > If someone is mixing slices and integer indices, that's a really good sign > that they thought indexing behaved in a different way (e.g. orthogonal > indexing). > I agree, but I'm still not entirely sure where to draw the line on behavior that should issue a warning. Some options, in roughly descending order of severity: 1. Warn if [] would give a different result than .oindex[]. This is the current proposal in the NEP, but based on the feedback we should hold back on it for now. 2. Warn if there is a mixture of arrays/slice objects in indices for [], even implicitly (e.g., including arr[idx] when is equivalent to arr[idx, :]). In this case, indices end up at the end both for legacy_index and vindex, but arguably that is only a happy coincidence. 3. Warn if [] would give a different result from .vindex[]. This is a little weaker than the previous condition, because arr[idx, :] or arr[idx, ...] would not give a warning. However, cases like arr[..., idx] or arr[:, idx, :] would still start to give warnings, even though they are arguably well defined according to either outer indexing (if idx.ndim == 1) or legacy indexing (due to dimension reordering rules that will be omitted from vindex). 4. Warn if there are multiple arrays/integer indices separated by a slice object, e.g., arr[idx1, :, idx2]. This is the edge case that really trips up users. As I said in my other response, in the long term, I would prefer to either (a) drop support for vectorized indexing in [] or (b) if we stick with supporting vectorized indexing in [], at least ensure consistent dimension ordering rules for [] and vindex[]. That would suggest using either my proposed rule 2 or 3. I also agree with you that anyone mixing slices and integers probably is confused about how indexing works, at least in edge cases. But given the lengths that legacy indexing goes to to support "outer indexing-like" behavior in the common case of a single integer array and many slices, I am hesitant to start warning in this case. The result of arr[..., idx, :] is relatively easy to understand, even though it uses its own set of rules, which happen to be more consistent with oindex[] than vindex[]. We certainly could make the conservative choice of only adopting 4 for now and leaving further cleanup for later. I guess this uncertainty about whether direct indexing should be more like vindex[] or oindex[] in the long term is a good argument for holding off on other warnings for now. But I think we are almost certainly going to want to make further warnings/deprecations of some form. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Wed Jun 27 01:19:58 2018 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Wed, 27 Jun 2018 15:19:58 +1000 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> Message-ID: <1530076798.4187189.1421749616.4AC83CBA@webmail.messagingengine.com> Let me start by thanking Robert for articulating my viewpoints far better than I could have done myself. I want to explicitly flag the following statements for endorsement: > *I would still reserve warnings and deprecation for the cases where > the current behavior gives us something that no one wants. Those are > the real traps that people need to be warned away from.* > *In the post-NEP .oindex/.vindex order, everyone can get the behavior > that they want. Your argument for deprecation is now just about what > the default is, the semantics that get pride of place with the > shortest spelling. I am sympathetic to the feeling like you wish you > had a time machine to go fix a design with your new insight. But it > seems to me that just changing which semantics are the default has > relatively attenuated value while breaking compatibility for a > fundamental feature of numpy has significant costs. Just introducing > .oindex is the bulk of the value of this NEP. Everything else is > window dressing.* > *If someone is mixing slices and integer indices, that's a really good > sign that they thought indexing behaved in a different way (e.g. > orthogonal indexing).* I would offer the exception of trailing slices to this statement, though: In [1]: from skimage import data In [2]: astro = data.astronaut() In [3]: astro.shape Out[3]: (512, 512, 3) In [4]: rr, cc = np.array([1, 3, 3, 3]), np.array([1, 8, 9, 10]) In [5]: astro[rr, cc].shape Out[5]: (4, 3) In [6]: astro[rr, cc, :].shape Out[6]: (4, 3) This does exactly what I would expect. Going back to the motivation for the NEP, I think this bit, emphasis mine, is crucial: >> the existing rules for advanced indexing with multiple array indices >> are typically confusing to both new, **and in many cases even old,** >> users of NumPy I think it is ok for advanced indexing to be accessible to advanced users. I remember that it took me quite a while to grok NumPy advanced indexing, but once I did I just loved it. I also like that this syntax translates perfectly from integer indices to float coordinates in `ndimage.map_coordinates`. > *I'll go on record as saying that array-likes should respond to `a[rr, > cc]`, as in Juan's example, with the current behavior. And if they > don't, they don't deserve to be operated on by skimage functions.** * (I don't think of us highly enough to use the word "deserve", but I would say that we would hesitate to support arrays that don't use this convention.) > *They didn't get a new feature; they just have to run faster to stay > in the same place.** * It is also probably true, as mentioned elsewhere, that we could go through our entire codebase and append `.vidx` to every array indexing op. Perhaps others on this list find this a reasonable request, but I don't. Aside from the churn involved, it would make our codebase significantly uglier and less readable. I should also emphasise that NumPy is really *the* foundational project for the entire Scientific Python ecosystem. Changing the meaning of [] should only be considered if it delivers an *extreme* benefit. Robert's statement would apply to a stupid number of projects. > *Once we have some experience with them for a year or three, then > let's talk about deprecating parts of the current behavior and make a > new NEP then if we want to go that route.** * :+10**6: To Sebastian's comment: > if we choose to not annoy you a little, we will > have much less long term options which also includes such projects > compatibility to new/current array-likes. > So basically one point is: if we annoy scikit-image now, their code > will work better for dask arrays in the future hopefully. Let's get rid of the hopefully. Let NumPy implement .oindex and .vindex. Let Dask arrays do the same. Let's have an announcement on the scikit-image mailing list, "hey guys, if you switch all your indexing operations to .vindex, suddenly all of your library works with dask arrays!" At that point, we have a value proposition on our hands. Currently, it amounts to gambling with others' time. To Stephan's options that were sent while I was composing this: > Some options, in roughly descending order of severity: I favour 4, or at the limit 3. (See use case above, which I would argue is totally unsurprising.) I'm happy that option 1 appears to be off the table. Hameer, > For libraries like Dask, XArray, pydata/sparse, XND, etc., it would be > bad for them if there was continued use of ?weird? indexing behaviour > (no warnings means more code written that?s? well? not exactly the > best design). Again, I think libraries should support the simple/not unintuitive vindex cases. This is not bad design. > *We don't know which of those futures are going to be true. > Anecdatally, you want .oindex semantics most often; I would almost > exclusively use .vindex. I don't know which of us is more > representative.* Same. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Jun 27 01:21:49 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 22:21:49 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 9:50 PM Stephan Hoyer wrote: > On Tue, Jun 26, 2018 at 12:46 AM Robert Kern > wrote: > >> I think having more self-contained descriptions of the semantics of each >> of these would be a good idea. The current description of `.vindex` spends >> more time talking about what it doesn't do, compared to the other methods, >> than what it does. >> > > Will do. > > >> I'm still leaning towards not warning on current, unproblematic common >> uses. It's unnecessary churn for currently working, understandable code. I >> would still reserve warnings and deprecation for the cases where the >> current behavior gives us something that no one wants. Those are the real >> traps that people need to be warned away from. >> >> If someone is mixing slices and integer indices, that's a really good >> sign that they thought indexing behaved in a different way (e.g. orthogonal >> indexing). >> > > I agree, but I'm still not entirely sure where to draw the line on > behavior that should issue a warning. Some options, in roughly descending > order of severity: > 1. Warn if [] would give a different result than .oindex[]. This is the > current proposal in the NEP, but based on the feedback we should hold back > on it for now. > 2. Warn if there is a mixture of arrays/slice objects in indices for [], > even implicitly (e.g., including arr[idx] when is equivalent to arr[idx, > :]). In this case, indices end up at the end both for legacy_index and > vindex, but arguably that is only a happy coincidence. > I'd have to deep dive through my email archive to double check, but I'm pretty sure this is intentional design, not coincidence. There is a long-standing pattern of using the first axes as the "collection" axes when the objects that we are concerned with are vectors or matrices or more. For example, evaluate a scalar field on a grid in 3D space (nx, ny, nz), then the gradient at those points is usually represented as (nx, ny, nz, 3). It is desirable to be able to apply the same indices to the scalar grid and the vector grid to select out the scalar and vector values at the same set of points. It's why we implicitly tack on empty slices to the end of any partial index tuple (e.g. with just integer scalars). The current rules for mixing slices and integer array indices are possibly the simplest way to effect this use case; it is the behaviors for the other cases that are the unhappy coincidences. 3. Warn if [] would give a different result from .vindex[]. This is a > little weaker than the previous condition, because arr[idx, :] or arr[idx, > ...] would not give a warning. However, cases like arr[..., idx] or arr[:, > idx, :] would still start to give warnings, even though they are arguably > well defined according to either outer indexing (if idx.ndim == 1) or > legacy indexing (due to dimension reordering rules that will be omitted > from vindex). > 4. Warn if there are multiple arrays/integer indices separated by a slice > object, e.g., arr[idx1, :, idx2]. This is the edge case that really trips > up users. > > As I said in my other response, in the long term, I would prefer to either > (a) drop support for vectorized indexing in [] or (b) if we stick with > supporting vectorized indexing in [], at least ensure consistent dimension > ordering rules for [] and vindex[]. That would suggest using either my > proposed rule 2 or 3. > > I also agree with you that anyone mixing slices and integers probably is > confused about how indexing works, at least in edge cases. But given the > lengths that legacy indexing goes to to support "outer indexing-like" > behavior in the common case of a single integer array and many slices, I am > hesitant to start warning in this case. The result of arr[..., idx, :] is > relatively easy to understand, even though it uses its own set of rules, > which happen to be more consistent with oindex[] than vindex[]. > > We certainly could make the conservative choice of only adopting 4 for now > and leaving further cleanup for later. I guess this uncertainty about > whether direct indexing should be more like vindex[] or oindex[] in the > long term is a good argument for holding off on other warnings for now. But > I think we are almost certainly going to want to make further > warnings/deprecations of some form. > I'd prefer 4, could be talked into 3, but any higher is not a good idea, I don't think. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Wed Jun 27 01:26:44 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 26 Jun 2018 22:26:44 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: <1530076798.4187189.1421749616.4AC83CBA@webmail.messagingengine.com> References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> <3aa8387021a6367c7a6227e424226904631bce60.camel@sipsolutions.net> <1530076798.4187189.1421749616.4AC83CBA@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 10:21 PM Juan Nunez-Iglesias wrote: > Let me start by thanking Robert for articulating my viewpoints far better > than I could have done myself. I want to explicitly flag the following > statements for endorsement: > > *I would still reserve warnings and deprecation for the cases where the > current behavior gives us something that no one wants. Those are the real > traps that people need to be warned away from.* > > > *In the post-NEP .oindex/.vindex order, everyone can get the behavior that > they want. Your argument for deprecation is now just about what the default > is, the semantics that get pride of place with the shortest spelling. I am > sympathetic to the feeling like you wish you had a time machine to go fix a > design with your new insight. But it seems to me that just changing which > semantics are the default has relatively attenuated value while breaking > compatibility for a fundamental feature of numpy has significant costs. > Just introducing .oindex is the bulk of the value of this NEP. Everything > else is window dressing.* > > > *If someone is mixing slices and integer indices, that's a really good > sign that they thought indexing behaved in a different way (e.g. orthogonal > indexing).* > > > I would offer the exception of trailing slices to this statement, though: > > In [1]: from skimage import data > In [2]: astro = data.astronaut() > In [3]: astro.shape > Out[3]: (512, 512, 3) > > In [4]: rr, cc = np.array([1, 3, 3, 3]), np.array([1, 8, 9, 10]) > In [5]: astro[rr, cc].shape > Out[5]: (4, 3) > > In [6]: astro[rr, cc, :].shape > Out[6]: (4, 3) > > This does exactly what I would expect. > Yup, sorry, I didn't mean those. I meant when there is an explicit slice in between index arrays. (And maybe when index arrays follow slices; I'll need to think more on that.) > Going back to the motivation for the NEP, I think this bit, emphasis mine, > is crucial: > > the existing rules for advanced indexing with multiple array indices are > typically confusing to both new, **and in many cases even old,** users of > NumPy > > > I think it is ok for advanced indexing to be accessible to advanced users. > I remember that it took me quite a while to grok NumPy advanced indexing, > but once I did I just loved it. > > I also like that this syntax translates perfectly from integer indices to > float coordinates in `ndimage.map_coordinates`. > > *I'll go on record as saying that array-likes should respond to `a[rr, > cc]`, as in Juan's example, with the current behavior. And if they don't, > they don't deserve to be operated on by skimage functions.* > > > (I don't think of us highly enough to use the word "deserve", but I would > say that we would hesitate to support arrays that don't use this > convention.) > Ahem, yes, I was being provocative in a moment of weakness. May the array-like authors forgive me. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jun 27 01:34:30 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jun 2018 22:34:30 -0700 Subject: [Numpy-discussion] NEP 21: Simplified and explicit advanced indexing In-Reply-To: References: <1529994253.102107.1420482720.1911BC65@webmail.messagingengine.com> Message-ID: On Tue, Jun 26, 2018 at 10:22 PM Robert Kern wrote: > We certainly could make the conservative choice of only adopting 4 for now >> and leaving further cleanup for later. I guess this uncertainty about >> whether direct indexing should be more like vindex[] or oindex[] in the >> long term is a good argument for holding off on other warnings for now. But >> I think we are almost certainly going to want to make further >> warnings/deprecations of some form. >> > > I'd prefer 4, could be talked into 3, but any higher is not a good idea, I > don't think. > OK, I think 4 is the safe option for now. Eventually, I want either 1 or 3. But: - We don't agree yet on whether the right long-term solution would be for [] to support vectorized indexing, outer indexing or neither. - This will certainly cause some amount of churn, so let's save it for later when vindex/oindex are widely used and libraries don't need to worry about whether they're available or not they are available in all NumPy versions they support. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jun 27 01:48:59 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 27 Jun 2018 01:48:59 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol Message-ID: After much discussion (and the addition of three new co-authors!), I?m pleased to present a significantly revision of NumPy Enhancement Proposal 18: A dispatch mechanism for NumPy's high level array functions: http://www.numpy.org/neps/nep-0018-array-function-protocol.html The full text is also included below. Best, Stephan =========================================================== A dispatch mechanism for NumPy's high level array functions =========================================================== :Author: Stephan Hoyer :Author: Matthew Rocklin :Author: Marten van Kerkwijk :Author: Hameer Abbasi :Author: Eric Wieser :Status: Draft :Type: Standards Track :Created: 2018-05-29 Abstact ------- We propose the ``__array_function__`` protocol, to allow arguments of NumPy functions to define how that function operates on them. This will allow using NumPy as a high level API for efficient multi-dimensional array operations, even with array implementations that differ greatly from ``numpy.ndarray``. Detailed description -------------------- NumPy's high level ndarray API has been implemented several times outside of NumPy itself for different architectures, such as for GPU arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel arrays (Dask array) as well as various NumPy-like implementations in the deep learning frameworks, like TensorFlow and PyTorch. Similarly there are many projects that build on top of the NumPy API for labeled and indexed arrays (XArray), automatic differentiation (Autograd, Tangent), masked arrays (numpy.ma), physical units (astropy.units, pint, unyt), etc. that add additional functionality on top of the NumPy API. Most of these project also implement a close variation of NumPy's level high API. We would like to be able to use these libraries together, for example we would like to be able to place a CuPy array within XArray, or perform automatic differentiation on Dask array code. This would be easier to accomplish if code written for NumPy ndarrays could also be used by other NumPy-like projects. For example, we would like for the following code example to work equally well with any NumPy-like array object: .. code:: python def f(x): y = np.tensordot(x, x.T) return np.mean(np.exp(y)) Some of this is possible today with various protocol mechanisms within NumPy. - The ``np.exp`` function checks the ``__array_ufunc__`` protocol - The ``.T`` method works using Python's method dispatch - The ``np.mean`` function explicitly checks for a ``.mean`` method on the argument However other functions, like ``np.tensordot`` do not dispatch, and instead are likely to coerce to a NumPy array (using the ``__array__``) protocol, or err outright. To achieve enough coverage of the NumPy API to support downstream projects like XArray and autograd we want to support *almost all* functions within NumPy, which calls for a more reaching protocol than just ``__array_ufunc__``. We would like a protocol that allows arguments of a NumPy function to take control and divert execution to another function (for example a GPU or parallel implementation) in a way that is safe and consistent across projects. Implementation -------------- We propose adding support for a new protocol in NumPy, ``__array_function__``. This protocol is intended to be a catch-all for NumPy functionality that is not covered by the ``__array_ufunc__`` protocol for universal functions (like ``np.exp``). The semantics are very similar to ``__array_ufunc__``, except the operation is specified by an arbitrary callable object rather than a ufunc instance and method. A prototype implementation can be found in `this notebook < https://nbviewer.jupyter.org/gist/shoyer/1f0a308a06cd96df20879a1ddb8f0006 >`_. The interface ~~~~~~~~~~~~~ We propose the following signature for implementations of ``__array_function__``: .. code-block:: python def __array_function__(self, func, types, args, kwargs) - ``func`` is an arbitrary callable exposed by NumPy's public API, which was called in the form ``func(*args, **kwargs)``. - ``types`` is a ``frozenset`` of unique argument types from the original NumPy function call that implement ``__array_function__``. - The tuple ``args`` and dict ``kwargs`` are directly passed on from the original call. Unlike ``__array_ufunc__``, there are no high-level guarantees about the type of ``func``, or about which of ``args`` and ``kwargs`` may contain objects implementing the array API. As a convenience for ``__array_function__`` implementors, ``types`` provides all argument types with an ``'__array_function__'`` attribute. This allows downstream implementations to quickly determine if they are likely able to support the operation. A ``frozenset`` is used to ensure that ``__array_function__`` implementations cannot rely on the iteration order of ``types``, which would facilitate violating the well-defined "Type casting hierarchy" described in `NEP-13 `_. Example for a project implementing the NumPy API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most implementations of ``__array_function__`` will start with two checks: 1. Is the given function something that we know how to overload? 2. Are all arguments of a type that we know how to handle? If these conditions hold, ``__array_function__`` should return the result from calling its implementation for ``func(*args, **kwargs)``. Otherwise, it should return the sentinel value ``NotImplemented``, indicating that the function is not implemented by these types. This is preferable to raising ``TypeError`` directly, because it gives *other* arguments the opportunity to define the operations. There are no general requirements on the return value from ``__array_function__``, although most sensible implementations should probably return array(s) with the same type as one of the function's arguments. If/when Python gains `typing support for protocols `_ and NumPy adds static type annotations, the ``@overload`` implementation for ``SupportsArrayFunction`` will indicate a return type of ``Any``. It may also be convenient to define a custom decorators (``implements`` below) for registering ``__array_function__`` implementations. .. code:: python HANDLED_FUNCTIONS = {} class MyArray: def __array_function__(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented # Note: this allows subclasses that don't override # __array_function__ to handle MyArray objects if not all(issubclass(t, MyArray) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs) def implements(numpy_function): """Register an __array_function__ implementation for MyArray objects.""" def decorator(func): HANDLED_FUNCTIONS[numpy_function] = func return func return decorator @implements(np.concatenate) def concatenate(arrays, axis=0, out=None): ... # implementation of concatenate for MyArray objects @implements(np.broadcast_to) def broadcast_to(array, shape): ... # implementation of broadcast_to for MyArray objects Note that it is not required for ``__array_function__`` implementations to include *all* of the corresponding NumPy function's optional arguments (e.g., ``broadcast_to`` above omits the irrelevant ``subok`` argument). Optional arguments are only passed in to ``__array_function__`` if they were explicitly used in the NumPy function call. Necessary changes within the NumPy codebase itself ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This will require two changes within the NumPy codebase: 1. A function to inspect available inputs, look for the ``__array_function__`` attribute on those inputs, and call those methods appropriately until one succeeds. This needs to be fast in the common all-NumPy case, and have acceptable performance (no worse than linear time) even if the number of overloaded inputs is large (e.g., as might be the case for `np.concatenate`). This is one additional function of moderate complexity. 2. Calling this function within all relevant NumPy functions. This affects many parts of the NumPy codebase, although with very low complexity. Finding and calling the right ``__array_function__`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a NumPy function, ``*args`` and ``**kwargs`` inputs, we need to search through ``*args`` and ``**kwargs`` for all appropriate inputs that might have the ``__array_function__`` attribute. Then we need to select among those possible methods and execute the right one. Negotiating between several possible implementations can be complex. Finding arguments ''''''''''''''''' Valid arguments may be directly in the ``*args`` and ``**kwargs``, such as in the case for ``np.tensordot(left, right, out=out)``, or they may be nested within lists or dictionaries, such as in the case of ``np.concatenate([x, y, z])``. This can be problematic for two reasons: 1. Some functions are given long lists of values, and traversing them might be prohibitively expensive. 2. Some functions may have arguments that we don't want to inspect, even if they have the ``__array_function__`` method. To resolve these issues, NumPy functions should explicitly indicate which of their arguments may be overloaded, and how these arguments should be checked. As a rule, this should include all arguments documented as either ``array_like`` or ``ndarray``. We propose to do so by writing "dispatcher" functions for each overloaded NumPy function: - These functions will be called with the exact same arguments that were passed into the NumPy function (i.e., ``dispatcher(*args, **kwargs)``), and should return an iterable of arguments to check for overrides. - Dispatcher functions are required to share the exact same positional, optional and keyword-only arguments as their corresponding NumPy functions. Otherwise, valid invocations of a NumPy function could result in an error when calling its dispatcher. - Because default *values* for keyword arguments do not have ``__array_function__`` attributes, by convention we set all default argument values to ``None``. This reduces the likelihood of signatures falling out of sync, and minimizes extraneous information in the dispatcher. The only exception should be cases where the argument value in some way effects dispatching, which should be rare. An example of the dispatcher for ``np.concatenate`` may be instructive: .. code:: python def _concatenate_dispatcher(arrays, axis=None, out=None): for array in arrays: yield array if out is not None: yield out The concatenate dispatcher is written as generator function, which allows it to potentially include the value of the optional ``out`` argument without needing to create a new sequence with the (potentially long) list of objects to be concatenated. Trying ``__array_function__`` methods until the right one works ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Many arguments may implement the ``__array_function__`` protocol. Some of these may decide that, given the available inputs, they are unable to determine the correct result. How do we call the right one? If several are valid then which has precedence? For the most part, the rules for dispatch with ``__array_function__`` match those for ``__array_ufunc__`` (see `NEP-13 `_). In particular: - NumPy will gather implementations of ``__array_function__`` from all specified inputs and call them in order: subclasses before superclasses, and otherwise left to right. Note that in some edge cases involving subclasses, this differs slightly from the `current behavior `_ of Python. - Implementations of ``__array_function__`` indicate that they can handle the operation by returning any value other than ``NotImplemented``. - If all ``__array_function__`` methods return ``NotImplemented``, NumPy will raise ``TypeError``. One deviation from the current behavior of ``__array_ufunc__`` is that NumPy will only call ``__array_function__`` on the *first* argument of each unique type. This matches Python's `rule for calling reflected methods < https://docs.python.org/3/reference/datamodel.html#object.__ror__>`_, and this ensures that checking overloads has acceptable performance even when there are a large number of overloaded arguments. To avoid long-term divergence between these two dispatch protocols, we should `also update `_ ``__array_ufunc__`` to match this behavior. Special handling of ``numpy.ndarray`` ''''''''''''''''''''''''''''''''''''' The use cases for subclasses with ``__array_function__`` are the same as those with ``__array_ufunc__``, so ``numpy.ndarray`` should also define a ``__array_function__`` method mirroring ``ndarray.__array_ufunc__``: .. code:: python def __array_function__(self, func, types, args, kwargs): # Cannot handle items that have __array_function__ other than our own. for t in types: if (hasattr(t, '__array_function__') and t.__array_function__ is not ndarray.__array_function__): return NotImplemented # Arguments contain no overrides, so we can safely call the # overloaded function again. return func(*args, **kwargs) To avoid infinite recursion, the dispatch rules for ``__array_function__`` need also the same special case they have for ``__array_ufunc__``: any arguments with an ``__array_function__`` method that is identical to ``numpy.ndarray.__array_function__`` are not be called as ``__array_function__`` implementations. Changes within NumPy functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a function defining the above behavior, for now call it ``try_array_function_override``, we now need to call that function from within every relevant NumPy function. This is a pervasive change, but of fairly simple and innocuous code that should complete quickly and without effect if no arguments implement the ``__array_function__`` protocol. In most cases, these functions should written using the ``array_function_dispatch`` decorator, which also associates dispatcher functions: .. code:: python def array_function_dispatch(dispatcher): """Wrap a function for dispatch with the __array_function__ protocol.""" def decorator(func): @functools.wraps(func) def new_func(*args, **kwargs): relevant_arguments = dispatcher(*args, **kwargs) success, value = try_array_function_override( new_func, relevant_arguments, args, kwargs) if success: return value return func(*args, **kwargs) return new_func return decorator # example usage def _broadcast_to_dispatcher(array, shape, subok=None, **ignored_kwargs): return (array,) @array_function_dispatch(_broadcast_to_dispatcher) def broadcast_to(array, shape, subok=False): ... # existing definition of np.broadcast_to Using a decorator is great! We don't need to change the definitions of existing NumPy functions, and only need to write a few additional lines for the dispatcher function. We could even reuse a single dispatcher for families of functions with the same signature (e.g., ``sum`` and ``prod``). For such functions, the largest change could be adding a few lines to the docstring to note which arguments are checked for overloads. It's particularly worth calling out the decorator's use of ``functools.wraps``: - This ensures that the wrapped function has the same name and docstring as the wrapped NumPy function. - On Python 3, it also ensures that the decorator function copies the original function signature, which is important for introspection based tools such as auto-complete. If we care about preserving function signatures on Python 2, for the `short while longer < http://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html>`_ that NumPy supports Python 2.7, we do could do so by adding a vendored dependency on the (single-file, BSD licensed) `decorator library `_. - Finally, it ensures that the wrapped function `can be pickled < http://gael-varoquaux.info/programming/decoration-in-python-done-right-decorating-and-pickling.html >`_. In a few cases, it would not make sense to use the ``array_function_dispatch`` decorator directly, but override implementation in terms of ``try_array_function_override`` should still be straightforward. - Functions written entirely in C (e.g., ``np.concatenate``) can't use decorators, but they could still use a C equivalent of ``try_array_function_override``. If performance is not a concern, they could also be easily wrapped with a small Python wrapper. - The ``__call__`` method of ``np.vectorize`` can't be decorated with

`_. Example for a project implementing the NumPy API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most implementations of ``__array_function__`` will start with two checks: 1. Is the given function something that we know how to overload? 2. Are all arguments of a type that we know how to handle? If these conditions hold, ``__array_function__`` should return the result from calling its implementation for ``func(*args, **kwargs)``. Otherwise, it should return the sentinel value ``NotImplemented``, indicating that the function is not implemented by these types. This is preferable to raising ``TypeError`` directly, because it gives *other* arguments the opportunity to define the operations. There are no general requirements on the return value from ``__array_function__``, although most sensible implementations should probably return array(s) with the same type as one of the function's arguments. If/when Python gains `typing support for protocols `_ and NumPy adds static type annotations, the ``@overload`` implementation for ``SupportsArrayFunction`` will indicate a return type of ``Any``. It may also be convenient to define a custom decorators (``implements`` below) for registering ``__array_function__`` implementations. .. code:: python HANDLED_FUNCTIONS = {} class MyArray: def __array_function__(self, func, types, args, kwargs): if func not in HANDLED_FUNCTIONS: return NotImplemented # Note: this allows subclasses that don't override # __array_function__ to handle MyArray objects if not all(issubclass(t, MyArray) for t in types): return NotImplemented return HANDLED_FUNCTIONS[func](*args, **kwargs) def implements(numpy_function): """Register an __array_function__ implementation for MyArray objects.""" def decorator(func): HANDLED_FUNCTIONS[numpy_function] = func return func return decorator @implements(np.concatenate) def concatenate(arrays, axis=0, out=None): ... # implementation of concatenate for MyArray objects @implements(np.broadcast_to) def broadcast_to(array, shape): ... # implementation of broadcast_to for MyArray objects Note that it is not required for ``__array_function__`` implementations to include *all* of the corresponding NumPy function's optional arguments (e.g., ``broadcast_to`` above omits the irrelevant ``subok`` argument). Optional arguments are only passed in to ``__array_function__`` if they were explicitly used in the NumPy function call. Necessary changes within the NumPy codebase itself ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This will require two changes within the NumPy codebase: 1. A function to inspect available inputs, look for the ``__array_function__`` attribute on those inputs, and call those methods appropriately until one succeeds. This needs to be fast in the common all-NumPy case, and have acceptable performance (no worse than linear time) even if the number of overloaded inputs is large (e.g., as might be the case for `np.concatenate`). This is one additional function of moderate complexity. 2. Calling this function within all relevant NumPy functions. This affects many parts of the NumPy codebase, although with very low complexity. Finding and calling the right ``__array_function__`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a NumPy function, ``*args`` and ``**kwargs`` inputs, we need to search through ``*args`` and ``**kwargs`` for all appropriate inputs that might have the ``__array_function__`` attribute. Then we need to select among those possible methods and execute the right one. Negotiating between several possible implementations can be complex. Finding arguments ''''''''''''''''' Valid arguments may be directly in the ``*args`` and ``**kwargs``, such as in the case for ``np.tensordot(left, right, out=out)``, or they may be nested within lists or dictionaries, such as in the case of ``np.concatenate([x, y, z])``. This can be problematic for two reasons: 1. Some functions are given long lists of values, and traversing them might be prohibitively expensive. 2. Some functions may have arguments that we don't want to inspect, even if they have the ``__array_function__`` method. To resolve these issues, NumPy functions should explicitly indicate which of their arguments may be overloaded, and how these arguments should be checked. As a rule, this should include all arguments documented as either ``array_like`` or ``ndarray``. We propose to do so by writing "dispatcher" functions for each overloaded NumPy function: - These functions will be called with the exact same arguments that were passed into the NumPy function (i.e., ``dispatcher(*args, **kwargs)``), and should return an iterable of arguments to check for overrides. - Dispatcher functions are required to share the exact same positional, optional and keyword-only arguments as their corresponding NumPy functions. Otherwise, valid invocations of a NumPy function could result in an error when calling its dispatcher. - Because default *values* for keyword arguments do not have ``__array_function__`` attributes, by convention we set all default argument values to ``None``. This reduces the likelihood of signatures falling out of sync, and minimizes extraneous information in the dispatcher. The only exception should be cases where the argument value in some way effects dispatching, which should be rare. An example of the dispatcher for ``np.concatenate`` may be instructive: .. code:: python def _concatenate_dispatcher(arrays, axis=None, out=None): for array in arrays: yield array if out is not None: yield out The concatenate dispatcher is written as generator function, which allows it to potentially include the value of the optional ``out`` argument without needing to create a new sequence with the (potentially long) list of objects to be concatenated. Trying ``__array_function__`` methods until the right one works ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Many arguments may implement the ``__array_function__`` protocol. Some of these may decide that, given the available inputs, they are unable to determine the correct result. How do we call the right one? If several are valid then which has precedence? For the most part, the rules for dispatch with ``__array_function__`` match those for ``__array_ufunc__`` (see `NEP-13 `_). In particular: - NumPy will gather implementations of ``__array_function__`` from all specified inputs and call them in order: subclasses before superclasses, and otherwise left to right. Note that in some edge cases involving subclasses, this differs slightly from the `current behavior `_ of Python. - Implementations of ``__array_function__`` indicate that they can handle the operation by returning any value other than ``NotImplemented``. - If all ``__array_function__`` methods return ``NotImplemented``, NumPy will raise ``TypeError``. One deviation from the current behavior of ``__array_ufunc__`` is that NumPy will only call ``__array_function__`` on the *first* argument of each unique type. This matches Python's `rule for calling reflected methods < https://docs.python.org/3/reference/datamodel.html#object.__ror__>`_, and this ensures that checking overloads has acceptable performance even when there are a large number of overloaded arguments. To avoid long-term divergence between these two dispatch protocols, we should `also update `_ ``__array_ufunc__`` to match this behavior. Special handling of ``numpy.ndarray`` ''''''''''''''''''''''''''''''''''''' The use cases for subclasses with ``__array_function__`` are the same as those with ``__array_ufunc__``, so ``numpy.ndarray`` should also define a ``__array_function__`` method mirroring ``ndarray.__array_ufunc__``: .. code:: python def __array_function__(self, func, types, args, kwargs): # Cannot handle items that have __array_function__ other than our own. for t in types: if (hasattr(t, '__array_function__') and t.__array_function__ is not ndarray.__array_function__): return NotImplemented # Arguments contain no overrides, so we can safely call the # overloaded function again. return func(*args, **kwargs) To avoid infinite recursion, the dispatch rules for ``__array_function__`` need also the same special case they have for ``__array_ufunc__``: any arguments with an ``__array_function__`` method that is identical to ``numpy.ndarray.__array_function__`` are not be called as ``__array_function__`` implementations. Changes within NumPy functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a function defining the above behavior, for now call it ``try_array_function_override``, we now need to call that function from within every relevant NumPy function. This is a pervasive change, but of fairly simple and innocuous code that should complete quickly and without effect if no arguments implement the ``__array_function__`` protocol. In most cases, these functions should written using the ``array_function_dispatch`` decorator, which also associates dispatcher functions: .. code:: python def array_function_dispatch(dispatcher): """Wrap a function for dispatch with the __array_function__ protocol.""" def decorator(func): @functools.wraps(func) def new_func(*args, **kwargs): relevant_arguments = dispatcher(*args, **kwargs) success, value = try_array_function_override( new_func, relevant_arguments, args, kwargs) if success: return value return func(*args, **kwargs) return new_func return decorator # example usage def _broadcast_to_dispatcher(array, shape, subok=None, **ignored_kwargs): return (array,) @array_function_dispatch(_broadcast_to_dispatcher) def broadcast_to(array, shape, subok=False): ... # existing definition of np.broadcast_to Using a decorator is great! We don't need to change the definitions of existing NumPy functions, and only need to write a few additional lines for the dispatcher function. We could even reuse a single dispatcher for families of functions with the same signature (e.g., ``sum`` and ``prod``). For such functions, the largest change could be adding a few lines to the docstring to note which arguments are checked for overloads. It's particularly worth calling out the decorator's use of ``functools.wraps``: - This ensures that the wrapped function has the same name and docstring as the wrapped NumPy function. - On Python 3, it also ensures that the decorator function copies the original function signature, which is important for introspection based tools such as auto-complete. If we care about preserving function signatures on Python 2, for the `short while longer < http://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html>`_ that NumPy supports Python 2.7, we do could do so by adding a vendored dependency on the (single-file, BSD licensed) `decorator library `_. - Finally, it ensures that the wrapped function `can be pickled < http://gael-varoquaux.info/programming/decoration-in-python-done-right-decorating-and-pickling.html >`_. In a few cases, it would not make sense to use the ``array_function_dispatch`` decorator directly, but override implementation in terms of ``try_array_function_override`` should still be straightforward. - Functions written entirely in C (e.g., ``np.concatenate``) can't use decorators, but they could still use a C equivalent of ``try_array_function_override``. If performance is not a concern, they could also be easily wrapped with a small Python wrapper. - The ``__call__`` method of ``np.vectorize`` can't be decorated with

> would say that we would hesitate to support arrays that don't use > > this convention.) > > > > Ahem, yes, I was being provocative in a moment of weakness. May the > array-like authors forgive me. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From m.h.vankerkwijk at gmail.com Wed Jun 27 11:41:39 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Wed, 27 Jun 2018 11:41:39 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Hameer, I'm confused: Isn't your reference array just `self`? All the best, Marten On Wed, Jun 27, 2018 at 2:27 AM, Hameer Abbasi wrote: > > > On 27. Jun 2018 at 07:48, Stephan Hoyer wrote: > > > After much discussion (and the addition of three new co-authors!), I?m > pleased to present a significantly revision of NumPy Enhancement Proposal > 18: A dispatch mechanism for NumPy's high level array functions: > http://www.numpy.org/neps/nep-0018-array-function-protocol.html > > The full text is also included below. > > Best, > Stephan > > =========================================================== > A dispatch mechanism for NumPy's high level array functions > =========================================================== > > :Author: Stephan Hoyer > :Author: Matthew Rocklin > :Author: Marten van Kerkwijk > :Author: Hameer Abbasi > :Author: Eric Wieser > :Status: Draft > :Type: Standards Track > :Created: 2018-05-29 > > Abstact > ------- > > We propose the ``__array_function__`` protocol, to allow arguments of NumPy > functions to define how that function operates on them. This will allow > using NumPy as a high level API for efficient multi-dimensional array > operations, even with array implementations that differ greatly from > ``numpy.ndarray``. > > Detailed description > -------------------- > > NumPy's high level ndarray API has been implemented several times > outside of NumPy itself for different architectures, such as for GPU > arrays (CuPy), Sparse arrays (scipy.sparse, pydata/sparse) and parallel > arrays (Dask array) as well as various NumPy-like implementations in the > deep learning frameworks, like TensorFlow and PyTorch. > > Similarly there are many projects that build on top of the NumPy API > for labeled and indexed arrays (XArray), automatic differentiation > (Autograd, Tangent), masked arrays (numpy.ma), physical units > (astropy.units, > pint, unyt), etc. that add additional functionality on top of the NumPy > API. > Most of these project also implement a close variation of NumPy's level > high > API. > > We would like to be able to use these libraries together, for example we > would like to be able to place a CuPy array within XArray, or perform > automatic differentiation on Dask array code. This would be easier to > accomplish if code written for NumPy ndarrays could also be used by > other NumPy-like projects. > > For example, we would like for the following code example to work > equally well with any NumPy-like array object: > > .. code:: python > > def f(x): > y = np.tensordot(x, x.T) > return np.mean(np.exp(y)) > > Some of this is possible today with various protocol mechanisms within > NumPy. > > - The ``np.exp`` function checks the ``__array_ufunc__`` protocol > - The ``.T`` method works using Python's method dispatch > - The ``np.mean`` function explicitly checks for a ``.mean`` method on > the argument > > However other functions, like ``np.tensordot`` do not dispatch, and > instead are likely to coerce to a NumPy array (using the ``__array__``) > protocol, or err outright. To achieve enough coverage of the NumPy API > to support downstream projects like XArray and autograd we want to > support *almost all* functions within NumPy, which calls for a more > reaching protocol than just ``__array_ufunc__``. We would like a > protocol that allows arguments of a NumPy function to take control and > divert execution to another function (for example a GPU or parallel > implementation) in a way that is safe and consistent across projects. > > Implementation > -------------- > > We propose adding support for a new protocol in NumPy, > ``__array_function__``. > > This protocol is intended to be a catch-all for NumPy functionality that > is not covered by the ``__array_ufunc__`` protocol for universal functions > (like ``np.exp``). The semantics are very similar to ``__array_ufunc__``, > except > the operation is specified by an arbitrary callable object rather than a > ufunc > instance and method. > > A prototype implementation can be found in > `this notebook 1f0a308a06cd96df20879a1ddb8f0006>`_. > > The interface > ~~~~~~~~~~~~~ > > We propose the following signature for implementations of > ``__array_function__``: > > .. code-block:: python > > def __array_function__(self, func, types, args, kwargs) > > - ``func`` is an arbitrary callable exposed by NumPy's public API, > which was called in the form ``func(*args, **kwargs)``. > - ``types`` is a ``frozenset`` of unique argument types from the original > NumPy > function call that implement ``__array_function__``. > - The tuple ``args`` and dict ``kwargs`` are directly passed on from the > original call. > > Unlike ``__array_ufunc__``, there are no high-level guarantees about the > type of ``func``, or about which of ``args`` and ``kwargs`` may contain > objects > implementing the array API. > > As a convenience for ``__array_function__`` implementors, ``types`` > provides all > argument types with an ``'__array_function__'`` attribute. This > allows downstream implementations to quickly determine if they are likely > able > to support the operation. A ``frozenset`` is used to ensure that > ``__array_function__`` implementations cannot rely on the iteration order > of > ``types``, which would facilitate violating the well-defined "Type casting > hierarchy" described in > `NEP-13 `_. > > Example for a project implementing the NumPy API > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Most implementations of ``__array_function__`` will start with two > checks: > > 1. Is the given function something that we know how to overload? > 2. Are all arguments of a type that we know how to handle? > > If these conditions hold, ``__array_function__`` should return > the result from calling its implementation for ``func(*args, **kwargs)``. > Otherwise, it should return the sentinel value ``NotImplemented``, > indicating > that the function is not implemented by these types. This is preferable to > raising ``TypeError`` directly, because it gives *other* arguments the > opportunity to define the operations. > > There are no general requirements on the return value from > ``__array_function__``, although most sensible implementations should > probably > return array(s) with the same type as one of the function's arguments. > If/when Python gains > `typing support for protocols >`_ > and NumPy adds static type annotations, the ``@overload`` implementation > for ``SupportsArrayFunction`` will indicate a return type of ``Any``. > > It may also be convenient to define a custom decorators (``implements`` > below) > for registering ``__array_function__`` implementations. > > .. code:: python > > HANDLED_FUNCTIONS = {} > > class MyArray: > def __array_function__(self, func, types, args, kwargs): > if func not in HANDLED_FUNCTIONS: > return NotImplemented > # Note: this allows subclasses that don't override > # __array_function__ to handle MyArray objects > if not all(issubclass(t, MyArray) for t in types): > return NotImplemented > return HANDLED_FUNCTIONS[func](*args, **kwargs) > > def implements(numpy_function): > """Register an __array_function__ implementation for MyArray > objects.""" > def decorator(func): > HANDLED_FUNCTIONS[numpy_function] = func > return func > return decorator > > @implements(np.concatenate) > def concatenate(arrays, axis=0, out=None): > ... # implementation of concatenate for MyArray objects > > @implements(np.broadcast_to) > def broadcast_to(array, shape): > ... # implementation of broadcast_to for MyArray objects > > Note that it is not required for ``__array_function__`` implementations to > include *all* of the corresponding NumPy function's optional arguments > (e.g., ``broadcast_to`` above omits the irrelevant ``subok`` argument). > Optional arguments are only passed in to ``__array_function__`` if they > were explicitly used in the NumPy function call. > > Necessary changes within the NumPy codebase itself > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This will require two changes within the NumPy codebase: > > 1. A function to inspect available inputs, look for the > ``__array_function__`` attribute on those inputs, and call those > methods appropriately until one succeeds. This needs to be fast in the > common all-NumPy case, and have acceptable performance (no worse than > linear time) even if the number of overloaded inputs is large (e.g., > as might be the case for `np.concatenate`). > > This is one additional function of moderate complexity. > 2. Calling this function within all relevant NumPy functions. > > This affects many parts of the NumPy codebase, although with very low > complexity. > > Finding and calling the right ``__array_function__`` > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a NumPy function, ``*args`` and ``**kwargs`` inputs, we need to > search through ``*args`` and ``**kwargs`` for all appropriate inputs > that might have the ``__array_function__`` attribute. Then we need to > select among those possible methods and execute the right one. > Negotiating between several possible implementations can be complex. > > Finding arguments > ''''''''''''''''' > > Valid arguments may be directly in the ``*args`` and ``**kwargs``, such > as in the case for ``np.tensordot(left, right, out=out)``, or they may > be nested within lists or dictionaries, such as in the case of > ``np.concatenate([x, y, z])``. This can be problematic for two reasons: > > 1. Some functions are given long lists of values, and traversing them > might be prohibitively expensive. > 2. Some functions may have arguments that we don't want to inspect, even > if they have the ``__array_function__`` method. > > To resolve these issues, NumPy functions should explicitly indicate which > of their arguments may be overloaded, and how these arguments should be > checked. As a rule, this should include all arguments documented as either > ``array_like`` or ``ndarray``. > > We propose to do so by writing "dispatcher" functions for each overloaded > NumPy function: > > - These functions will be called with the exact same arguments that were > passed > into the NumPy function (i.e., ``dispatcher(*args, **kwargs)``), and > should > return an iterable of arguments to check for overrides. > - Dispatcher functions are required to share the exact same positional, > optional and keyword-only arguments as their corresponding NumPy > functions. > Otherwise, valid invocations of a NumPy function could result in an > error when > calling its dispatcher. > - Because default *values* for keyword arguments do not have > ``__array_function__`` attributes, by convention we set all default > argument > values to ``None``. This reduces the likelihood of signatures falling out > of sync, and minimizes extraneous information in the dispatcher. > The only exception should be cases where the argument value in some way > effects dispatching, which should be rare. > > An example of the dispatcher for ``np.concatenate`` may be instructive: > > .. code:: python > > def _concatenate_dispatcher(arrays, axis=None, out=None): > for array in arrays: > yield array > if out is not None: > yield out > > The concatenate dispatcher is written as generator function, which allows > it > to potentially include the value of the optional ``out`` argument without > needing to create a new sequence with the (potentially long) list of > objects > to be concatenated. > > Trying ``__array_function__`` methods until the right one works > ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' > > Many arguments may implement the ``__array_function__`` protocol. Some > of these may decide that, given the available inputs, they are unable to > determine the correct result. How do we call the right one? If several > are valid then which has precedence? > > For the most part, the rules for dispatch with ``__array_function__`` > match those for ``__array_ufunc__`` (see > `NEP-13 `_). > In particular: > > - NumPy will gather implementations of ``__array_function__`` from all > specified inputs and call them in order: subclasses before > superclasses, and otherwise left to right. Note that in some edge cases > involving subclasses, this differs slightly from the > `current behavior `_ of Python. > - Implementations of ``__array_function__`` indicate that they can > handle the operation by returning any value other than > ``NotImplemented``. > - If all ``__array_function__`` methods return ``NotImplemented``, > NumPy will raise ``TypeError``. > > One deviation from the current behavior of ``__array_ufunc__`` is that > NumPy > will only call ``__array_function__`` on the *first* argument of each > unique > type. This matches Python's > `rule for calling reflected methods reference/datamodel.html#object.__ror__>`_, > and this ensures that checking overloads has acceptable performance even > when > there are a large number of overloaded arguments. To avoid long-term > divergence > between these two dispatch protocols, we should > `also update `_ > ``__array_ufunc__`` to match this behavior. > > Special handling of ``numpy.ndarray`` > ''''''''''''''''''''''''''''''''''''' > > The use cases for subclasses with ``__array_function__`` are the same as > those > with ``__array_ufunc__``, so ``numpy.ndarray`` should also define a > ``__array_function__`` method mirroring ``ndarray.__array_ufunc__``: > > .. code:: python > > def __array_function__(self, func, types, args, kwargs): > # Cannot handle items that have __array_function__ other than our > own. > for t in types: > if (hasattr(t, '__array_function__') and > t.__array_function__ is not > ndarray.__array_function__): > return NotImplemented > > # Arguments contain no overrides, so we can safely call the > # overloaded function again. > return func(*args, **kwargs) > > To avoid infinite recursion, the dispatch rules for ``__array_function__`` > need > also the same special case they have for ``__array_ufunc__``: any > arguments with > an ``__array_function__`` method that is identical to > ``numpy.ndarray.__array_function__`` are not be called as > ``__array_function__`` implementations. > > Changes within NumPy functions > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Given a function defining the above behavior, for now call it > ``try_array_function_override``, we now need to call that function from > within every relevant NumPy function. This is a pervasive change, but of > fairly simple and innocuous code that should complete quickly and > without effect if no arguments implement the ``__array_function__`` > protocol. > > In most cases, these functions should written using the > ``array_function_dispatch`` decorator, which also associates dispatcher > functions: > > .. code:: python > > def array_function_dispatch(dispatcher): > """Wrap a function for dispatch with the __array_function__ > protocol.""" > def decorator(func): > @functools.wraps(func) > def new_func(*args, **kwargs): > relevant_arguments = dispatcher(*args, **kwargs) > success, value = try_array_function_override( > new_func, relevant_arguments, args, kwargs) > if success: > return value > return func(*args, **kwargs) > return new_func > return decorator > > # example usage > def _broadcast_to_dispatcher(array, shape, subok=None, > **ignored_kwargs): > return (array,) > > @array_function_dispatch(_broadcast_to_dispatcher) > def broadcast_to(array, shape, subok=False): > ... # existing definition of np.broadcast_to > > Using a decorator is great! We don't need to change the definitions of > existing NumPy functions, and only need to write a few additional lines > for the dispatcher function. We could even reuse a single dispatcher for > families of functions with the same signature (e.g., ``sum`` and ``prod``). > For such functions, the largest change could be adding a few lines to the > docstring to note which arguments are checked for overloads. > > It's particularly worth calling out the decorator's use of > ``functools.wraps``: > > - This ensures that the wrapped function has the same name and docstring as > the wrapped NumPy function. > - On Python 3, it also ensures that the decorator function copies the > original > function signature, which is important for introspection based tools > such as > auto-complete. If we care about preserving function signatures on Python > 2, > for the `short while longer nep-0014-dropping-python2.7-proposal.html>`_ > that NumPy supports Python 2.7, we do could do so by adding a vendored > dependency on the (single-file, BSD licensed) > `decorator library `_. > - Finally, it ensures that the wrapped function > `can be pickled python-done-right-decorating-and-pickling.html>`_. > > In a few cases, it would not make sense to use the > ``array_function_dispatch`` > decorator directly, but override implementation in terms of > ``try_array_function_override`` should still be straightforward. > > - Functions written entirely in C (e.g., ``np.concatenate``) can't use > decorators, but they could still use a C equivalent of > ``try_array_function_override``. If performance is not a concern, they > could > also be easily wrapped with a small Python wrapper. > - The ``__call__`` method of ``np.vectorize`` can't be decorated with >

From m.h.vankerkwijk at gmail.com Thu Jun 28 08:37:41 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 28 Jun 2018 08:37:41 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Wed, Jun 27, 2018 at 3:50 PM, Stephan Hoyer wrote: > So perhaps it's worth "future proofing" the interface by passing `obj` and > `method` to __array_function__ rather than only `func`. It is slower to > call a func via func.__call__ than func, but only very marginally (~100 ns > in my tests). > That would make it more similar yet to `__array_ufunc__`, but I'm not sure how useful it is, as you cannot generically assume the methods have the same arguments and hence they need their own dispatcher. Once you're there you might as well pass them on directly (since any callable can be used as the function). Indeed, for `__array_ufunc__`, this might not have been a bad idea either... -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Thu Jun 28 08:46:02 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Thu, 28 Jun 2018 08:46:02 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: I think the usefulness of this feature is actually needed. Consider `np.random.RandomState`. If we were to add what I proposed, the two could work very nicely to (for example) do things like creating Dask random arrays, from RandomState objects. For reproducibility, Dask could generate multiple RandomState objects with a seed sequential in the job numbers. Looping in Matt Rocklin for this ? He might have some input about the design. Best Regards, Hameer Abbasi Sent from Astro for Mac On 28. Jun 2018 at 14:37, Marten van Kerkwijk wrote: On Wed, Jun 27, 2018 at 3:50 PM, Stephan Hoyer wrote: > So perhaps it's worth "future proofing" the interface by passing `obj` and > `method` to __array_function__ rather than only `func`. It is slower to > call a func via func.__call__ than func, but only very marginally (~100 ns > in my tests). > That would make it more similar yet to `__array_ufunc__`, but I'm not sure how useful it is, as you cannot generically assume the methods have the same arguments and hence they need their own dispatcher. Once you're there you might as well pass them on directly (since any callable can be used as the function). Indeed, for `__array_ufunc__`, this might not have been a bad idea either... -- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Jun 28 14:04:19 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 28 Jun 2018 11:04:19 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Wed, Jun 27, 2018 at 12:50 PM Stephan Hoyer wrote: > One concern this does raise is how to handle methods like those on > RandomState, even though methods like random_like() don't currently exist. > Distribution objects from scipy.stats could have similar use cases. > > So perhaps it's worth "future proofing" the interface by passing `obj` and > `method` to __array_function__ rather than only `func`. It is slower to > call a func via func.__call__ than func, but only very marginally (~100 ns > in my tests). > I did a little more digging, and turned up the __self__ and __func__ attributes of bound methods: https://stackoverflow.com/questions/4679592/how-to-find-instance-of-a-bound-method-in-python So we might need another decorator function, but it seems that the current interface would actually suffice just fine for overriding methods. I'll update the NEP with some examples. It will look something like: def __array_function__(self, func, types, args, kwargs): ... if isinstance(func, types.MethodType): object = func.__self__ unbound_func = func.__func__ ... Given that functions are the most common case, I think it's best to keep with `func` as the main interface, but it's good to know that this does not preclude overriding methods. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Thu Jun 28 16:11:07 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Thu, 28 Jun 2018 16:11:07 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: > I did a little more digging, and turned up the __self__ and __func__ > attributes of bound methods: > https://stackoverflow.com/questions/4679592/how-to-find- > instance-of-a-bound-method-in-python > > So we might need another decorator function, but it seems that the current > interface would actually suffice just fine for overriding methods. I'll > update the NEP with some examples. It will look something like: > > def __array_function__(self, func, types, args, kwargs): > ... > if isinstance(func, types.MethodType): > object = func.__self__ > unbound_func = func.__func__ > ... > > For C classes like the ufuncs, it seems `__self__` is defined for methods as well (at least, `np.add.reduce.__self__` gives `np.add`), but not a `__func__`. There is a `__name__` (="reduce"), though, which means that I think one can still retrieve what is needed (obviously, this also means `__array_ufunc__` could have been simpler...) -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Jun 28 20:18:28 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 28 Jun 2018 17:18:28 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Thu, Jun 28, 2018 at 1:12 PM Marten van Kerkwijk < m.h.vankerkwijk at gmail.com> wrote: > For C classes like the ufuncs, it seems `__self__` is defined for methods > as well (at least, `np.add.reduce.__self__` gives `np.add`), but not a > `__func__`. There is a `__name__` (="reduce"), though, which means that I > think one can still retrieve what is needed (obviously, this also means > `__array_ufunc__` could have been simpler...) > Good point! I guess this means we should encourage using __name__ rather than __func__. I would not want to preclude refactoring classes from Python to C/Cython. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Thu Jun 28 20:35:17 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Thu, 28 Jun 2018 17:35:17 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Another option would be to directly compare the methods against known ones: obj = func.__self__ if isinstance(obj, np.ufunc): if func is obj.reduce: got_reduction() Eric ? On Thu, 28 Jun 2018 at 17:19 Stephan Hoyer wrote: > On Thu, Jun 28, 2018 at 1:12 PM Marten van Kerkwijk < > m.h.vankerkwijk at gmail.com> wrote: > >> For C classes like the ufuncs, it seems `__self__` is defined for methods >> as well (at least, `np.add.reduce.__self__` gives `np.add`), but not a >> `__func__`. There is a `__name__` (="reduce"), though, which means that I >> think one can still retrieve what is needed (obviously, this also means >> `__array_ufunc__` could have been simpler...) >> > > Good point! > > I guess this means we should encourage using __name__ rather than > __func__. I would not want to preclude refactoring classes from Python to > C/Cython. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Thu Jun 28 20:34:32 2018 From: matti.picus at gmail.com (Matti Picus) Date: Thu, 28 Jun 2018 17:34:32 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On 28/06/18 17:18, Stephan Hoyer wrote: > On Thu, Jun 28, 2018 at 1:12 PM Marten van Kerkwijk > > wrote: > > For C classes like the ufuncs, it seems `__self__` is defined for > methods as well (at least, `np.add.reduce.__self__` gives > `np.add`), but not a `__func__`. There is a `__name__` > (="reduce"), though, which means that I think one can still > retrieve what is needed (obviously, this also means > `__array_ufunc__` could have been simpler...) > > > Good point! > > I guess this means we should encourage using __name__ rather than > __func__. I would not want to preclude refactoring classes from Python > to C/Cython. > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion There was opposition to that in a PR I made to provide a wrapper around matmul to turn it into a ufunc. It would have left the __name__ but changed the __func__. https://github.com/numpy/numpy/pull/11061#issuecomment-387468084 From einstein.edison at gmail.com Thu Jun 28 22:48:14 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Thu, 28 Jun 2018 22:48:14 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Martin, It is. The point of the proposed feature was to handle array generation mechanisms, that don't take an array as input in the standard NumPy API. Giving them a reference handles both the dispatch and the decision about which implementation to call. I'm confused: Isn't your reference array just `self`? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 29 13:15:59 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 29 Jun 2018 11:15:59 -0600 Subject: [Numpy-discussion] Github down on comcast Message-ID: Hi All, Just a note for those who may be having a problem reaching Github, it is currently down for comcast users. See http://downdetector.com/status/github/map/. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniele at grinta.net Fri Jun 29 15:30:10 2018 From: daniele at grinta.net (Daniele Nicolodi) Date: Fri, 29 Jun 2018 13:30:10 -0600 Subject: [Numpy-discussion] Github down on comcast In-Reply-To: References: Message-ID: <213a3cec-bfbe-7f6d-8335-3a6a6e34d229@grinta.net> On 6/29/18 11:15 AM, Charles R Harris wrote: > Hi All, > > Just a note for those who may be having a problem reaching Github, it is > currently down for comcast users. > See?http://downdetector.com/status/github/map/. Funny enough http://dowdetector.com seems to not be reachable from this side of the Internet :-) Cheers, Dan From njs at pobox.com Fri Jun 29 18:18:20 2018 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 29 Jun 2018 15:18:20 -0700 Subject: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath Message-ID: Hi all, I propose that we accept NEP 15: Merging multiarray and umath: http://www.numpy.org/neps/nep-0015-merge-multiarray-umath.html The core part of this proposal was uncontroversial. The main point of discussion was whether it was OK to deprecate set_numeric_ops, or whether it had some legitimate use cases. The conclusion was that in all the cases where set_numeric_ops is useful, PyUFunc_ReplaceLoopBySignature is a strictly better alternative, so there's no reason not to deprecate set_numeric_ops. So at this point I think the whole proposal is uncontroversial, and we can go ahead and accept it. If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details: http://www.numpy.org/neps/nep-0000.html -n -- Nathaniel J. Smith -- https://vorpus.org From njs at pobox.com Fri Jun 29 18:23:05 2018 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 29 Jun 2018 15:23:05 -0700 Subject: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath In-Reply-To: References: Message-ID: Note that this is the first formal proposal to accept a NEP using our new process (yay!). While writing it I realized that the current text about this in NEP 0 is a bit terse, so I've also just submitted a PR to expand that section: https://github.com/numpy/numpy/pull/11459 -n On Fri, Jun 29, 2018 at 3:18 PM, Nathaniel Smith wrote: > Hi all, > > I propose that we accept NEP 15: Merging multiarray and umath: > > http://www.numpy.org/neps/nep-0015-merge-multiarray-umath.html > > The core part of this proposal was uncontroversial. The main point of > discussion was whether it was OK to deprecate set_numeric_ops, or > whether it had some legitimate use cases. The conclusion was that in > all the cases where set_numeric_ops is useful, > PyUFunc_ReplaceLoopBySignature is a strictly better alternative, so > there's no reason not to deprecate set_numeric_ops. So at this point I > think the whole proposal is uncontroversial, and we can go ahead and > accept it. > > If there are no substantive objections within 7 days from this email, > then the NEP will be accepted; see NEP 0 for more details: > http://www.numpy.org/neps/nep-0000.html > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org -- Nathaniel J. Smith -- https://vorpus.org From m.h.vankerkwijk at gmail.com Fri Jun 29 18:28:03 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Fri, 29 Jun 2018 18:28:03 -0400 Subject: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath In-Reply-To: References: Message-ID: Agreed on accepting the NEP! But it is not the first proposal to accept under the new rules - that goes to the broadcasting NEP (though perhaps I wasn't sufficiently explicit in stating that I was starting a count-down...). -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 29 18:31:06 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 29 Jun 2018 16:31:06 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Tue, Jun 26, 2018 at 3:55 PM, Matthew Brett wrote: > Hi, > > On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus > wrote: > > On 19/06/18 10:57, Matthew Brett wrote: > >> > >> Hi, > >> > >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus > >> wrote: > >>> > >>> On 19/06/18 09:58, Charles R Harris wrote: > >>>>> > >>>>> What I was curious about is that there were no more "daily" builds of > >>>>> master. > >>>> > >>>> Is that right? That there were daily builds of master, on Appveyor? > >>>> I don't know how those worked, I only recently got cron permission ... > >>> > >>> > >>> No, but there used to be daily builds on travis. They stopped 8 days > ago, > >>> https://travis-ci.org/MacPython/numpy-wheels/builds. > >> > >> Oops - yes - sorry - I retired the 'daily' branch, in favor of > >> 'master', but forgot to update the Travis-CI settings. > >> > >> Done now. > >> > >> Cheers, > >> > >> Matthew > >> > > FWIW, still no daily builds at > > https://travis-ci.org/MacPython/numpy-wheels/builds > > You mean, some days there appears to be no build? The build matrix > does show Cron-triggered jobs, the last of which was a few hours ago: > https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 > > Cheers, > > Matthew > The cron wheels are getting built and tested, but they aren't uploading to rackspace. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Fri Jun 29 18:35:11 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Fri, 29 Jun 2018 23:35:11 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Fri, Jun 29, 2018 at 11:31 PM, Charles R Harris wrote: > > > On Tue, Jun 26, 2018 at 3:55 PM, Matthew Brett > wrote: >> >> Hi, >> >> On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus >> wrote: >> > On 19/06/18 10:57, Matthew Brett wrote: >> >> >> >> Hi, >> >> >> >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus >> >> wrote: >> >>> >> >>> On 19/06/18 09:58, Charles R Harris wrote: >> >>>>> >> >>>>> What I was curious about is that there were no more "daily" builds >> >>>>> of >> >>>>> master. >> >>>> >> >>>> Is that right? That there were daily builds of master, on Appveyor? >> >>>> I don't know how those worked, I only recently got cron permission >> >>>> ... >> >>> >> >>> >> >>> No, but there used to be daily builds on travis. They stopped 8 days >> >>> ago, >> >>> https://travis-ci.org/MacPython/numpy-wheels/builds. >> >> >> >> Oops - yes - sorry - I retired the 'daily' branch, in favor of >> >> 'master', but forgot to update the Travis-CI settings. >> >> >> >> Done now. >> >> >> >> Cheers, >> >> >> >> Matthew >> >> >> > FWIW, still no daily builds at >> > https://travis-ci.org/MacPython/numpy-wheels/builds >> >> You mean, some days there appears to be no build? The build matrix >> does show Cron-triggered jobs, the last of which was a few hours ago: >> https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 >> >> Cheers, >> >> Matthew > > > The cron wheels are getting built and tested, but they aren't uploading to > rackspace. The cron wheels go to the "pre" container at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com Cheers, Matthew From njs at pobox.com Fri Jun 29 19:50:11 2018 From: njs at pobox.com (Nathaniel Smith) Date: Fri, 29 Jun 2018 16:50:11 -0700 Subject: [Numpy-discussion] Proposal to accept NEP 15: Merging multiarray and umath In-Reply-To: References: Message-ID: On Fri, Jun 29, 2018 at 3:28 PM, Marten van Kerkwijk wrote: > Agreed on accepting the NEP! But it is not the first proposal to accept > under the new rules - that goes to the broadcasting NEP (though perhaps I > wasn't sufficiently explicit in stating that I was starting a > count-down...). -- Marten Oh sorry, I missed that! (Which I guess is some evidence in favor of starting a new thread :-).) -n -- Nathaniel J. Smith -- https://vorpus.org From charlesr.harris at gmail.com Fri Jun 29 19:36:53 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 29 Jun 2018 17:36:53 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Fri, Jun 29, 2018 at 4:35 PM, Matthew Brett wrote: > On Fri, Jun 29, 2018 at 11:31 PM, Charles R Harris > wrote: > > > > > > On Tue, Jun 26, 2018 at 3:55 PM, Matthew Brett > > wrote: > >> > >> Hi, > >> > >> On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus > >> wrote: > >> > On 19/06/18 10:57, Matthew Brett wrote: > >> >> > >> >> Hi, > >> >> > >> >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus > >> >> wrote: > >> >>> > >> >>> On 19/06/18 09:58, Charles R Harris wrote: > >> >>>>> > >> >>>>> What I was curious about is that there were no more "daily" builds > >> >>>>> of > >> >>>>> master. > >> >>>> > >> >>>> Is that right? That there were daily builds of master, on > Appveyor? > >> >>>> I don't know how those worked, I only recently got cron permission > >> >>>> ... > >> >>> > >> >>> > >> >>> No, but there used to be daily builds on travis. They stopped 8 days > >> >>> ago, > >> >>> https://travis-ci.org/MacPython/numpy-wheels/builds. > >> >> > >> >> Oops - yes - sorry - I retired the 'daily' branch, in favor of > >> >> 'master', but forgot to update the Travis-CI settings. > >> >> > >> >> Done now. > >> >> > >> >> Cheers, > >> >> > >> >> Matthew > >> >> > >> > FWIW, still no daily builds at > >> > https://travis-ci.org/MacPython/numpy-wheels/builds > >> > >> You mean, some days there appears to be no build? The build matrix > >> does show Cron-triggered jobs, the last of which was a few hours ago: > >> https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 > >> > >> Cheers, > >> > >> Matthew > > > > > > The cron wheels are getting built and tested, but they aren't uploading > to > > rackspace. > > The cron wheels go to the "pre" container at > https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a > 83.ssl.cf2.rackcdn.com > > Ah, there they are ... except ... you cancelled the builds I was waiting for :) I was building wheels so we could have folks test the DLL load problem, which I'm pretty sure if fixed anyway, so I suppose waiting on the daily isn't a big a deal. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jun 29 21:23:15 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 29 Jun 2018 18:23:15 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Thu, Jun 28, 2018 at 5:36 PM Eric Wieser wrote: > Another option would be to directly compare the methods against known ones: > > obj = func.__self__ > if isinstance(obj, np.ufunc): > if func is obj.reduce: > got_reduction() > > I'm not quite sure why, but this doesn't seem to work with current ufunc objects: >>> np.add.reduce == np.add.reduce # OK True >>> np.add.reduce is np.add.reduce # what?!? False Maybe this is a bug? There's been some somewhat related discussion recently on python-dev: https://mail.python.org/pipermail/python-dev/2018-June/153959.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Fri Jun 29 21:54:38 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Fri, 29 Jun 2018 18:54:38 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Good catch, I think the latter failing is because np.add.reduce ends up calling np.ufunc.reduce.__get__(np.add), and builtin_function.__get__ doesn?t appear to do any caching. I suppose caching bound methods would just be a waste of time. == would work just fine in my suggestion above, it seems - irrespective of the resolution of the discussion on python-dev. Eric ? On Fri, 29 Jun 2018 at 18:24 Stephan Hoyer wrote: > On Thu, Jun 28, 2018 at 5:36 PM Eric Wieser > wrote: > >> Another option would be to directly compare the methods against known >> ones: >> >> obj = func.__self__ >> if isinstance(obj, np.ufunc): >> if func is obj.reduce: >> got_reduction() >> >> I'm not quite sure why, but this doesn't seem to work with current ufunc > objects: > > >>> np.add.reduce == np.add.reduce # OK > True > > >>> np.add.reduce is np.add.reduce # what?!? > False > > Maybe this is a bug? There's been some somewhat related discussion > recently on python-dev: > https://mail.python.org/pipermail/python-dev/2018-June/153959.html > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maifer at haverford.edu Fri Jun 29 22:21:18 2018 From: maifer at haverford.edu (Maxwell Aifer) Date: Fri, 29 Jun 2018 22:21:18 -0400 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies Message-ID: Hi, I noticed some frustrating inconsistencies in the various ways to evaluate polynomials using numpy. Numpy has three ways of evaluating polynomials (that I know of) and each of them has a different syntax: - numpy.polynomial.polynomial.Polynomial : You define a polynomial by a list of coefficients *in order of increasing degree*, and then use the class?s call() function. - np.polyval : Evaluates a polynomial at a point. *First* argument is the polynomial, or list of coefficients *in order of decreasing degree*, and the *second* argument is the point to evaluate at. - np.polynomial.polynomial.polyval : Also evaluates a polynomial at a point, but has more support for vectorization. *First* argument is the point to evaluate at, and *second* argument the list of coefficients *in order of increasing degree*. Not only the order of arguments is changed between different methods, but the order of the coefficients is reversed as well, leading to puzzling bugs (in my experience). What could be the reason for this madness? As polyval is a shameless ripoff of Matlab?s function of the same name anyway, why not just use matlab?s syntax (polyval([c0, c1, c2...], x)) across the board? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Fri Jun 29 23:10:16 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Fri, 29 Jun 2018 21:10:16 -0600 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: On Fri, Jun 29, 2018 at 8:21 PM, Maxwell Aifer wrote: > Hi, > I noticed some frustrating inconsistencies in the various ways to evaluate > polynomials using numpy. Numpy has three ways of evaluating polynomials > (that I know of) and each of them has a different syntax: > > - > > numpy.polynomial.polynomial.Polynomial > : > You define a polynomial by a list of coefficients *in order of > increasing degree*, and then use the class?s call() function. > - > > np.polyval > : > Evaluates a polynomial at a point. *First* argument is the polynomial, > or list of coefficients *in order of decreasing degree*, and the > *second* argument is the point to evaluate at. > - > > np.polynomial.polynomial.polyval > : > Also evaluates a polynomial at a point, but has more support for > vectorization. *First* argument is the point to evaluate at, and > *second* argument the list of coefficients *in order of increasing > degree*. > > Not only the order of arguments is changed between different methods, but > the order of the coefficients is reversed as well, leading to puzzling bugs > (in my experience). What could be the reason for this madness? As polyval > is a shameless ripoff of Matlab?s function of the same name > anyway, why not > just use matlab?s syntax (polyval([c0, c1, c2...], x)) across the board? > ? > > The polynomial package, with its various basis, deals with series, and especially with the truncated series approximations that are used in numerical work. Series are universally written in increasing order of the degree. The Polynomial class is efficient in a single variable, while the numpy.polynomial.polynomial.polyval function is intended as a building block and can also deal with multivariate polynomials or multidimensional arrays of polynomials, or a mix. See the simple implementation of polyval3d for an example. If you are just dealing with a single variable, use Polynomial, which will also track scaling and offsets for numerical stability and is generally much superior to the simple polyval function from a numerical point of view. As to the ordering of the degrees, learning that the degree matches the index is pretty easy and is a more natural fit for the implementation code, especially as the number of variables increases. I note that Matlab has ones based indexing, so that was really not an option for them. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Fri Jun 29 23:23:48 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Fri, 29 Jun 2018 20:23:48 -0700 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Here's my take on this, but it may not be an accurate summary of the history. `np.poly` is part of the original matlab-style API, built around `poly1d` objects. This isn't a great design, because they represent: p(x) = c[0] * x^2 + c[1] * x^1 + c[2] * x^0 For this reason, among others, the `np.polynomial` module was created, starting with a clean slate. The core of this is `np.polynomial.Polynomial`. There, everything uses the convention p(x) = c[0] * x^0 + c[1] * x^1 + c[2] * x^2 It sounds like we might need clearer docs explaining the difference, and pointing users to the more sensible `np.polynomial.Polynomial` Eric On Fri, 29 Jun 2018 at 20:10 Charles R Harris wrote: > On Fri, Jun 29, 2018 at 8:21 PM, Maxwell Aifer > wrote: > >> Hi, >> I noticed some frustrating inconsistencies in the various ways to >> evaluate polynomials using numpy. Numpy has three ways of evaluating >> polynomials (that I know of) and each of them has a different syntax: >> >> - >> >> numpy.polynomial.polynomial.Polynomial >> : >> You define a polynomial by a list of coefficients *in order of >> increasing degree*, and then use the class?s call() function. >> - >> >> np.polyval >> : >> Evaluates a polynomial at a point. *First* argument is the >> polynomial, or list of coefficients *in order of decreasing degree*, >> and the *second* argument is the point to evaluate at. >> - >> >> np.polynomial.polynomial.polyval >> : >> Also evaluates a polynomial at a point, but has more support for >> vectorization. *First* argument is the point to evaluate at, and >> *second* argument the list of coefficients *in order of increasing >> degree*. >> >> Not only the order of arguments is changed between different methods, but >> the order of the coefficients is reversed as well, leading to puzzling bugs >> (in my experience). What could be the reason for this madness? As polyval >> is a shameless ripoff of Matlab?s function of the same name >> anyway, why not >> just use matlab?s syntax (polyval([c0, c1, c2...], x)) across the board? >> ? >> >> > The polynomial package, with its various basis, deals with series, and > especially with the truncated series approximations that are used in > numerical work. Series are universally written in increasing order of the > degree. The Polynomial class is efficient in a single variable, while the > numpy.polynomial.polynomial.polyval function is intended as a building > block and can also deal with multivariate polynomials or multidimensional > arrays of polynomials, or a mix. See the simple implementation of polyval3d > for an example. If you are just dealing with a single variable, use > Polynomial, which will also track scaling and offsets for numerical > stability and is generally much superior to the simple polyval function > from a numerical point of view. > > As to the ordering of the degrees, learning that the degree matches the > index is pretty easy and is a more natural fit for the implementation code, > especially as the number of variables increases. I note that Matlab has > ones based indexing, so that was really not an option for them. > > Chuck > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.brett at gmail.com Sat Jun 30 04:32:10 2018 From: matthew.brett at gmail.com (Matthew Brett) Date: Sat, 30 Jun 2018 09:32:10 +0100 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 12:36 AM, Charles R Harris wrote: > > > On Fri, Jun 29, 2018 at 4:35 PM, Matthew Brett > wrote: >> >> On Fri, Jun 29, 2018 at 11:31 PM, Charles R Harris >> wrote: >> > >> > >> > On Tue, Jun 26, 2018 at 3:55 PM, Matthew Brett >> > wrote: >> >> >> >> Hi, >> >> >> >> On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus >> >> wrote: >> >> > On 19/06/18 10:57, Matthew Brett wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus >> >> >> wrote: >> >> >>> >> >> >>> On 19/06/18 09:58, Charles R Harris wrote: >> >> >>>>> >> >> >>>>> What I was curious about is that there were no more "daily" >> >> >>>>> builds >> >> >>>>> of >> >> >>>>> master. >> >> >>>> >> >> >>>> Is that right? That there were daily builds of master, on >> >> >>>> Appveyor? >> >> >>>> I don't know how those worked, I only recently got cron permission >> >> >>>> ... >> >> >>> >> >> >>> >> >> >>> No, but there used to be daily builds on travis. They stopped 8 >> >> >>> days >> >> >>> ago, >> >> >>> https://travis-ci.org/MacPython/numpy-wheels/builds. >> >> >> >> >> >> Oops - yes - sorry - I retired the 'daily' branch, in favor of >> >> >> 'master', but forgot to update the Travis-CI settings. >> >> >> >> >> >> Done now. >> >> >> >> >> >> Cheers, >> >> >> >> >> >> Matthew >> >> >> >> >> > FWIW, still no daily builds at >> >> > https://travis-ci.org/MacPython/numpy-wheels/builds >> >> >> >> You mean, some days there appears to be no build? The build matrix >> >> does show Cron-triggered jobs, the last of which was a few hours ago: >> >> https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 >> >> >> >> Cheers, >> >> >> >> Matthew >> > >> > >> > The cron wheels are getting built and tested, but they aren't uploading >> > to >> > rackspace. >> >> The cron wheels go to the "pre" container at >> >> https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com >> > > Ah, there they are ... except ... you cancelled the builds I was waiting for > :) I was building wheels so we could have folks test the DLL load problem, > which I'm pretty sure if fixed anyway, so I suppose waiting on the daily > isn't a big a deal. Oh - sorry - I was rushing to get 1.14.5 wheels built. Can you retrigger the builds? Do you want me to? Cheers, Matthew From m.h.vankerkwijk at gmail.com Sat Jun 30 09:51:15 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sat, 30 Jun 2018 09:51:15 -0400 Subject: [Numpy-discussion] Fwd: Allowing broadcasting of code dimensions in generalized ufuncs In-Reply-To: References: Message-ID: Hi All, In case it was missed because people have tuned out of the thread: Matti and I proposed last Tuesday to accept NEP 20 (on coming Tuesday, as per NEP 0), which introduces notation for generalized ufuncs allowing fixed, flexible and broadcastable core dimensions. For one thing, this will allow Matti to finish his work on making matmul a gufunc. See http://www.numpy.org/neps/nep-0020-gufunc-signature-enhancement.html All the best, Marten ---------- Forwarded message ---------- From: Marten van Kerkwijk Date: Tue, Jun 26, 2018 at 2:25 PM Subject: Re: [Numpy-discussion] Allowing broadcasting of code dimensions in generalized ufuncs To: Discussion of Numerical Python Hi All, Matti asked me to make a PR accepting my own NEP - https://github.com/numpy/numpy/pull/11429 Any objections? As noted in my earlier summary of the discussion, in principle we can choose to accept only parts, although I think it became clear that the most contentious is also the one arguably most needed, the flexible dimensions for matmul. Moving forward has the advantage that in 1.16 we will actually be able to deal with matmul. All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sat Jun 30 09:55:29 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sat, 30 Jun 2018 09:55:29 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Fri, Jun 29, 2018 at 9:54 PM, Eric Wieser wrote: > Good catch, > > I think the latter failing is because np.add.reduce ends up calling > np.ufunc.reduce.__get__(np.add), and builtin_function.__get__ doesn?t > appear to do any caching. I suppose caching bound methods would just be a > waste of time. > == would work just fine in my suggestion above, it seems - irrespective > of the resolution of the discussion on python-dev. > > Eric > ? > I think for implementers it might work easiest anyway to look up the ufunc itself in a dict or so and then check the name of the method. (At least, for my impementations of `__array_ufunc__`, it made a lot of sense to use the method in that way; possibly less so for the larger variety with other numpy functions). -- Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 30 09:57:44 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 30 Jun 2018 07:57:44 -0600 Subject: [Numpy-discussion] rackspace ssl certificates In-Reply-To: References: Message-ID: Not to worry, I'll just wait on the daily. On Sat, Jun 30, 2018 at 2:32 AM, Matthew Brett wrote: > On Sat, Jun 30, 2018 at 12:36 AM, Charles R Harris > wrote: > > > > > > On Fri, Jun 29, 2018 at 4:35 PM, Matthew Brett > > wrote: > >> > >> On Fri, Jun 29, 2018 at 11:31 PM, Charles R Harris > >> wrote: > >> > > >> > > >> > On Tue, Jun 26, 2018 at 3:55 PM, Matthew Brett < > matthew.brett at gmail.com> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> On Tue, Jun 26, 2018 at 10:43 PM, Matti Picus > > >> >> wrote: > >> >> > On 19/06/18 10:57, Matthew Brett wrote: > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> On Tue, Jun 19, 2018 at 6:27 PM, Matti Picus < > matti.picus at gmail.com> > >> >> >> wrote: > >> >> >>> > >> >> >>> On 19/06/18 09:58, Charles R Harris wrote: > >> >> >>>>> > >> >> >>>>> What I was curious about is that there were no more "daily" > >> >> >>>>> builds > >> >> >>>>> of > >> >> >>>>> master. > >> >> >>>> > >> >> >>>> Is that right? That there were daily builds of master, on > >> >> >>>> Appveyor? > >> >> >>>> I don't know how those worked, I only recently got cron > permission > >> >> >>>> ... > >> >> >>> > >> >> >>> > >> >> >>> No, but there used to be daily builds on travis. They stopped 8 > >> >> >>> days > >> >> >>> ago, > >> >> >>> https://travis-ci.org/MacPython/numpy-wheels/builds. > >> >> >> > >> >> >> Oops - yes - sorry - I retired the 'daily' branch, in favor of > >> >> >> 'master', but forgot to update the Travis-CI settings. > >> >> >> > >> >> >> Done now. > >> >> >> > >> >> >> Cheers, > >> >> >> > >> >> >> Matthew > >> >> >> > >> >> > FWIW, still no daily builds at > >> >> > https://travis-ci.org/MacPython/numpy-wheels/builds > >> >> > >> >> You mean, some days there appears to be no build? The build matrix > >> >> does show Cron-triggered jobs, the last of which was a few hours ago: > >> >> https://travis-ci.org/MacPython/numpy-wheels/builds/397008012 > >> >> > >> >> Cheers, > >> >> > >> >> Matthew > >> > > >> > > >> > The cron wheels are getting built and tested, but they aren't > uploading > >> > to > >> > rackspace. > >> > >> The cron wheels go to the "pre" container at > >> > >> https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a > 83.ssl.cf2.rackcdn.com > >> > > > > Ah, there they are ... except ... you cancelled the builds I was waiting > for > > :) I was building wheels so we could have folks test the DLL load > problem, > > which I'm pretty sure if fixed anyway, so I suppose waiting on the daily > > isn't a big a deal. > > Oh - sorry - I was rushing to get 1.14.5 wheels built. Can you > retrigger the builds? Do you want me to? > > Cheers, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sat Jun 30 10:02:52 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sat, 30 Jun 2018 10:02:52 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Hameer, It is. The point of the proposed feature was to handle array generation > mechanisms, that don't take an array as input in the standard NumPy API. > Giving them a reference handles both the dispatch and the decision about > which implementation to call. > Sorry, I had clearly misunderstood. It would indeed be nice for overrides to work on functions like `zeros` or `arange` as well, but it seems strange to change the signature just for that. As a possible alternative, should we perhaps generally check for overrides on `dtype`? All the best, Marten -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Sat Jun 30 10:40:29 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sat, 30 Jun 2018 07:40:29 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Marten, Sorry, I had clearly misunderstood. It would indeed be nice for overrides to work on functions like `zeros` or `arange` as well, but it seems strange to change the signature just for that. As a possible alternative, should we perhaps generally check for overrides on `dtype`? While this very clearly makes sense for something like astropy, it has a few drawbacks: - Other duck arrays such as Dask need more information than just the dtype. For example, Dask needs chunk sizes, XArray needs axis labels, and pydata/sparse needs to know the type of the reference array in order to make one of the same type. The information in a reference array is a strict superset of information in the dtype. - There?s a need for a separate protocol, which might be a lot harder to work with for both NumPy and library authors. - Some things, like numpy.random.RandomState, don?t accept a dtype argument. As for your concern about changing the signature, it?s easy enough with a decorator. We?ll need a separate decorator for array generation functions. Something like: def array_generation_function(func): @functools.wraps(func) def wrapped(*args, **kwargs, array_reference=np._NoValue): if array_reference is not np._NoValue: success, result = try_array_function_override(wrapped, [array_reference], args, kwargs) if success: return result return func(*args, **kwargs) return wrapped Hameer Abbasi -------------- next part -------------- An HTML attachment was scrubbed... URL: From maifer at haverford.edu Sat Jun 30 12:13:58 2018 From: maifer at haverford.edu (Maxwell Aifer) Date: Sat, 30 Jun 2018 12:13:58 -0400 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Thanks, that explains a lot! I didn't realize the reverse ordering actually originated with matlab's polyval, but that makes sense given the one-based indexing. I see why it is the way it is, but I still think it would make more sense for np.polyval() to use conventional indexing (c[0] * x^0 + c[1] * x^1 + c[2] * x^2). np.polyval() can be convenient when a polynomial object is just not needed, but if a single program uses both np.polyval() and np.polynomail.Polynomial, it seems bound to cause unnecessary confusion. Max On Fri, Jun 29, 2018 at 11:23 PM, Eric Wieser wrote: > Here's my take on this, but it may not be an accurate summary of the > history. > > `np.poly` is part of the original matlab-style API, built around > `poly1d` objects. This isn't a great design, because they represent: > > p(x) = c[0] * x^2 + c[1] * x^1 + c[2] * x^0 > > For this reason, among others, the `np.polynomial` module was created, > starting with a clean slate. The core of this is > `np.polynomial.Polynomial`. There, everything uses the convention > > p(x) = c[0] * x^0 + c[1] * x^1 + c[2] * x^2 > > It sounds like we might need clearer docs explaining the difference, and > pointing users to the more sensible `np.polynomial.Polynomial` > > Eric > > > > On Fri, 29 Jun 2018 at 20:10 Charles R Harris > wrote: > >> On Fri, Jun 29, 2018 at 8:21 PM, Maxwell Aifer >> wrote: >> >>> Hi, >>> I noticed some frustrating inconsistencies in the various ways to >>> evaluate polynomials using numpy. Numpy has three ways of evaluating >>> polynomials (that I know of) and each of them has a different syntax: >>> >>> - >>> >>> numpy.polynomial.polynomial.Polynomial >>> : >>> You define a polynomial by a list of coefficients *in order of >>> increasing degree*, and then use the class?s call() function. >>> - >>> >>> np.polyval >>> : >>> Evaluates a polynomial at a point. *First* argument is the >>> polynomial, or list of coefficients *in order of decreasing degree*, >>> and the *second* argument is the point to evaluate at. >>> - >>> >>> np.polynomial.polynomial.polyval >>> : >>> Also evaluates a polynomial at a point, but has more support for >>> vectorization. *First* argument is the point to evaluate at, and >>> *second* argument the list of coefficients *in order of increasing >>> degree*. >>> >>> Not only the order of arguments is changed between different methods, >>> but the order of the coefficients is reversed as well, leading to puzzling >>> bugs (in my experience). What could be the reason for this madness? As >>> polyval is a shameless ripoff of Matlab?s function of the same name >>> anyway, why >>> not just use matlab?s syntax (polyval([c0, c1, c2...], x)) across the >>> board? >>> ? >>> >>> >> The polynomial package, with its various basis, deals with series, and >> especially with the truncated series approximations that are used in >> numerical work. Series are universally written in increasing order of the >> degree. The Polynomial class is efficient in a single variable, while the >> numpy.polynomial.polynomial.polyval function is intended as a building >> block and can also deal with multivariate polynomials or multidimensional >> arrays of polynomials, or a mix. See the simple implementation of polyval3d >> for an example. If you are just dealing with a single variable, use >> Polynomial, which will also track scaling and offsets for numerical >> stability and is generally much superior to the simple polyval function >> from a numerical point of view. >> >> As to the ordering of the degrees, learning that the degree matches the >> index is pretty easy and is a more natural fit for the implementation code, >> especially as the number of variables increases. I note that Matlab has >> ones based indexing, so that was really not an option for them. >> >> Chuck >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.h.vankerkwijk at gmail.com Sat Jun 30 12:52:19 2018 From: m.h.vankerkwijk at gmail.com (Marten van Kerkwijk) Date: Sat, 30 Jun 2018 12:52:19 -0400 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Hameer, I think the override on `dtype` would work - after all, the override is checked before anything is done, so one can just pass in `self` if one wishes (or some helper class that contains both `self` and any desired further information. But, as you note, it would not cover everything, and your `array_reference` idea definitely makes things more uniform. Indeed, it would allow one to implement things like `np.zeros_like` using `np.zero`, which seems quite nice. Still, I'm not sure whether this should be included in the present NEP or is best done separately after, with a few concrete examples of where it would be useful. All the best, Marten On Sat, Jun 30, 2018 at 10:40 AM, Hameer Abbasi wrote: > Hi Marten, > > Sorry, I had clearly misunderstood. It would indeed be nice for overrides > to work on functions like `zeros` or `arange` as well, but it seems strange > to change the signature just for that. As a possible alternative, should we > perhaps generally check for overrides on `dtype`? > > > While this very clearly makes sense for something like astropy, it has a > few drawbacks: > > - Other duck arrays such as Dask need more information than just the > dtype. For example, Dask needs chunk sizes, XArray needs axis labels, and > pydata/sparse needs to know the type of the reference array in order > to make one of the same type. The information in a reference array is a > strict superset of information in the dtype. > - There?s a need for a separate protocol, which might be a lot harder > to work with for both NumPy and library authors. > - Some things, like numpy.random.RandomState, don?t accept a dtype > argument. > > As for your concern about changing the signature, it?s easy enough with a > decorator. We?ll need a separate decorator for array generation functions. > Something like: > > def array_generation_function(func): > @functools.wraps(func) > def wrapped(*args, **kwargs, array_reference=np._NoValue): > if array_reference is not np._NoValue: > success, result = try_array_function_override(wrapped, [array_reference], args, kwargs) > > if success: > return result > > return func(*args, **kwargs) > > return wrapped > > Hameer Abbasi > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sat Jun 30 14:09:56 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sat, 30 Jun 2018 11:09:56 -0700 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: > if a single program uses both np.polyval() and np.polynomail.Polynomial, it seems bound to cause unnecessary confusion. Yes, I would recommend definitely not doing that! > I still think it would make more sense for np.polyval() to use conventional indexing Unfortunately, it's too late for "making sense" to factor into the design. `polyval` is being used in the wild, so we're stuck with it behaving the way it does. At best, we can deprecate it and start telling people to move from `np.polyval` over to `np.polynomial.polynomial.polyval`. Perhaps we need to make this namespace less cumbersome in order for that to be a reasonable option. I also wonder if we want a more lightweight polynomial object without the extra domain and range information, which seem like they make `Polynomial` a more questionable drop-in replacement for `poly1d`. Eric On Sat, 30 Jun 2018 at 09:14 Maxwell Aifer wrote: > Thanks, that explains a lot! I didn't realize the reverse ordering > actually originated with matlab's polyval, but that makes sense given the > one-based indexing. I see why it is the way it is, but I still think it > would make more sense for np.polyval() to use conventional indexing (c[0] > * x^0 + c[1] * x^1 + c[2] * x^2). np.polyval() can be convenient when a > polynomial object is just not needed, but if a single program uses both > np.polyval() and np.polynomail.Polynomial, it seems bound to cause > unnecessary confusion. > > Max > > On Fri, Jun 29, 2018 at 11:23 PM, Eric Wieser > wrote: > >> Here's my take on this, but it may not be an accurate summary of the >> history. >> >> `np.poly` is part of the original matlab-style API, built around >> `poly1d` objects. This isn't a great design, because they represent: >> >> p(x) = c[0] * x^2 + c[1] * x^1 + c[2] * x^0 >> >> For this reason, among others, the `np.polynomial` module was created, >> starting with a clean slate. The core of this is >> `np.polynomial.Polynomial`. There, everything uses the convention >> >> p(x) = c[0] * x^0 + c[1] * x^1 + c[2] * x^2 >> >> It sounds like we might need clearer docs explaining the difference, and >> pointing users to the more sensible `np.polynomial.Polynomial` >> >> Eric >> >> >> >> On Fri, 29 Jun 2018 at 20:10 Charles R Harris >> wrote: >> >>> On Fri, Jun 29, 2018 at 8:21 PM, Maxwell Aifer >>> wrote: >>> >>>> Hi, >>>> I noticed some frustrating inconsistencies in the various ways to >>>> evaluate polynomials using numpy. Numpy has three ways of evaluating >>>> polynomials (that I know of) and each of them has a different syntax: >>>> >>>> - >>>> >>>> numpy.polynomial.polynomial.Polynomial >>>> : >>>> You define a polynomial by a list of coefficients *in order of >>>> increasing degree*, and then use the class?s call() function. >>>> - >>>> >>>> np.polyval >>>> : >>>> Evaluates a polynomial at a point. *First* argument is the >>>> polynomial, or list of coefficients *in order of decreasing degree*, >>>> and the *second* argument is the point to evaluate at. >>>> - >>>> >>>> np.polynomial.polynomial.polyval >>>> : >>>> Also evaluates a polynomial at a point, but has more support for >>>> vectorization. *First* argument is the point to evaluate at, and >>>> *second* argument the list of coefficients *in order of increasing >>>> degree*. >>>> >>>> Not only the order of arguments is changed between different methods, >>>> but the order of the coefficients is reversed as well, leading to puzzling >>>> bugs (in my experience). What could be the reason for this madness? As >>>> polyval is a shameless ripoff of Matlab?s function of the same name >>>> anyway, why >>>> not just use matlab?s syntax (polyval([c0, c1, c2...], x)) across the >>>> board? >>>> ? >>>> >>>> >>> The polynomial package, with its various basis, deals with series, and >>> especially with the truncated series approximations that are used in >>> numerical work. Series are universally written in increasing order of the >>> degree. The Polynomial class is efficient in a single variable, while the >>> numpy.polynomial.polynomial.polyval function is intended as a building >>> block and can also deal with multivariate polynomials or multidimensional >>> arrays of polynomials, or a mix. See the simple implementation of polyval3d >>> for an example. If you are just dealing with a single variable, use >>> Polynomial, which will also track scaling and offsets for numerical >>> stability and is generally much superior to the simple polyval function >>> from a numerical point of view. >>> >>> As to the ordering of the degrees, learning that the degree matches the >>> index is pretty easy and is a more natural fit for the implementation code, >>> especially as the number of variables increases. I note that Matlab has >>> ones based indexing, so that was really not an option for them. >>> >>> Chuck >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 30 14:30:18 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 30 Jun 2018 12:30:18 -0600 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 12:09 PM, Eric Wieser wrote: > > if a single program uses both np.polyval() and > np.polynomail.Polynomial, it seems bound to cause unnecessary confusion. > > Yes, I would recommend definitely not doing that! > > > I still think it would make more sense for np.polyval() to use > conventional indexing > > Unfortunately, it's too late for "making sense" to factor into the design. > `polyval` is being used in the wild, so we're stuck with it behaving the > way it does. At best, we can deprecate it and start telling people to move > from `np.polyval` over to `np.polynomial.polynomial.polyval`. Perhaps we > need to make this namespace less cumbersome in order for that to be a > reasonable option. > > I also wonder if we want a more lightweight polynomial object without the > extra domain and range information, which seem like they make `Polynomial` > a more questionable drop-in replacement for `poly1d`. > The defaults for domain and window make it like a regular polynomial. For fitting, it does adjust the range, but the usual form can be recovered with `p.convert()` and will usually have more accurate coefficients due to using a better conditioned matrix during the fit. In [1]: from numpy.polynomial import Polynomial as P In [2]: p = P([1, 2, 3], domain=(0,2)) In [3]: p(0) Out[3]: 2.0 In [4]: p.convert() Out[4]: Polynomial([ 2., -4., 3.], domain=[-1., 1.], window=[-1., 1.]) In [5]: p.convert()(0) Out[5]: 2.0 Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From einstein.edison at gmail.com Sat Jun 30 14:30:21 2018 From: einstein.edison at gmail.com (Hameer Abbasi) Date: Sat, 30 Jun 2018 11:30:21 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: Hi Marten, Still, I'm not sure whether this should be included in the present NEP or is best done separately after, with a few concrete examples of where it would be useful. There already are concrete examples from Dask and CuPy, and this is currently a blocker for them, which is part of the reason I?m pushing so hard for it. See #11074 for a context, and I think it was part of the reason that inspired Matt and Stephan to write this protocol in the first place. Best Regards, Hameer Abbasi -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sat Jun 30 15:13:11 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Sat, 30 Jun 2018 12:13:11 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 11:59 AM Hameer Abbasi wrote: > Hi Marten, > > Still, I'm not sure whether this should be included in the present NEP or > is best done separately after, with a few concrete examples of where it > would be useful. > > > There already are concrete examples from Dask and CuPy, and this is > currently a blocker for them, which is part of the reason I?m pushing so > hard for it. See #11074 for > a context, and I think it was part of the reason that inspired Matt and > Stephan to write this protocol in the first place. > Overloading np.ones_like() is definitely in scope already. I?d love to see a generic way of doing random number generation, but I agree with Martin that I don?t see it fitting a naturally into this NEP. An invasive change to add an array_reference argument to a bunch of functions might indeed be worthy of its own NEP, but again I?m not convinced that?s actually the right approach. I?d rather add a few new functions like random_like, which is a small enough change that concensus on the list might be enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilhanpolat at gmail.com Sat Jun 30 15:08:17 2018 From: ilhanpolat at gmail.com (Ilhan Polat) Date: Sat, 30 Jun 2018 21:08:17 +0200 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: I think restricting polynomials to time series is not a generic way and quite specific. Apart from the series and certain filter design actual usage of polynomials are always presented with decreasing order (control and signal processing included because they use powers of s and inverse powers of z if needed). So if that is the use case then probably it should go under a namespace of `TimeSeries` or at least require an option to present it in reverse. In my opinion polynomials are way more general than that domain and to everyone else it seems to me that "the intuitive way" is the decreasing powers. For the design > This isn't a great design, because they represent: > p(x) = c[0] * x^2 + c[1] * x^1 + c[2] * x^0 I don't see the problem actually. If I ask someone to write down the coefficients of a polynomial I don't think anyone would start from c[2]. On Sat, Jun 30, 2018 at 8:30 PM, Charles R Harris wrote: > > > On Sat, Jun 30, 2018 at 12:09 PM, Eric Wieser > wrote: > >> > if a single program uses both np.polyval() and >> np.polynomail.Polynomial, it seems bound to cause unnecessary confusion. >> >> Yes, I would recommend definitely not doing that! >> >> > I still think it would make more sense for np.polyval() to use >> conventional indexing >> >> Unfortunately, it's too late for "making sense" to factor into the >> design. `polyval` is being used in the wild, so we're stuck with it >> behaving the way it does. At best, we can deprecate it and start telling >> people to move from `np.polyval` over to `np.polynomial.polynomial.polyval`. >> Perhaps we need to make this namespace less cumbersome in order for that to >> be a reasonable option. >> >> I also wonder if we want a more lightweight polynomial object without the >> extra domain and range information, which seem like they make `Polynomial` >> a more questionable drop-in replacement for `poly1d`. >> > > The defaults for domain and window make it like a regular polynomial. For > fitting, it does adjust the range, but the usual form can be recovered with > `p.convert()` and will usually have more accurate coefficients due to using > a better conditioned matrix during the fit. > > In [1]: from numpy.polynomial import Polynomial as P > > In [2]: p = P([1, 2, 3], domain=(0,2)) > > In [3]: p(0) > Out[3]: 2.0 > > In [4]: p.convert() > Out[4]: Polynomial([ 2., -4., 3.], domain=[-1., 1.], window=[-1., 1.]) > > In [5]: p.convert()(0) > Out[5]: 2.0 > > Chuck > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 30 16:56:14 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 30 Jun 2018 14:56:14 -0600 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 1:08 PM, Ilhan Polat wrote: > I think restricting polynomials to time series is not a generic way and > quite specific. > I think more of complex analysis and it's use of series. > Apart from the series and certain filter design actual usage of > polynomials are always presented with decreasing order (control and signal > processing included because they use powers of s and inverse powers of z if > needed). So if that is the use case then probably it should go under a > namespace of `TimeSeries` or at least require an option to present it in > reverse. In my opinion polynomials are way more general than that domain > and to everyone else it seems to me that "the intuitive way" is the > decreasing powers. > > In approximation, say by Chebyshev polynomials, the coefficients will typically drop off sharply above a certain degree. This has two effects, first, the coefficients that one really cares about are of low degree and should come first, and second, one can truncate the coefficients easily with c[:n]. So in this usage ordering by increasing degree is natural. This is the series idea, fundamental to analysis. Algebraically, interest centers on the degree of the polynomial, which determines the number of zeros and general shape, consequently from the point of view of the algebraist, working with polynomials of finite predetermined degree, arranging the coefficients in order of decreasing degree makes sense and is traditional. That said, I am not actually sure where the high to low ordering of polynomials came from. It could even be like the Arabic numeral system, which when read properly from right to left, has its terms arranged from small to greater. It may even be that the polynomial convention derives that of the Arabic numerals. Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sat Jun 30 17:30:03 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sat, 30 Jun 2018 14:30:03 -0700 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: ?the intuitive way? is the decreasing powers. An argument against this is that accessing the ith power of x is spelt: - x.coeffs[i] for increasing powers - x.coeffs[-i-1] for decreasing powers The former is far more natural than the latter, and avoids a potential off-by-one error If I ask someone to write down the coefficients of a polynomial I don?t think anyone would start from c[2] You wouldn?t? I?d expect to see [image: f(x) = a_3x^3 + a_2x^2 + a_1x + a_0] rather than [image: f(x) = a_0x^3 + a_1x^2 + a_2x + a_3] Sure, I?d write it starting with the highest power, but I?d still number my coefficients to match the powers. Eric ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From maifer at haverford.edu Sat Jun 30 17:33:22 2018 From: maifer at haverford.edu (Maxwell Aifer) Date: Sat, 30 Jun 2018 17:33:22 -0400 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Interesting, I wasn't aware that both conventions were widely used. Speaking of series with inverse powers (i.e. Laurent series), I wonder how useful it would be to create a class to represent expressions with integral powers from -m to n. These come up in my work sometimes, and I usually represent them with coefficient arrays ordered like this: c[0]*x^0 + ... + c[n]*x^n + c[n+1]x^-m + ... + c[n+m+1]*x^-1 Because then with negative indexing you have: c[-m]*x^-m + ... + c[n]*x^n Still, these objects can't be manipulated as nicely as polynomials because they aren't closed under integration and differentiation (you get log terms). Max On Sat, Jun 30, 2018 at 4:56 PM, Charles R Harris wrote: > > > On Sat, Jun 30, 2018 at 1:08 PM, Ilhan Polat wrote: > >> I think restricting polynomials to time series is not a generic way and >> quite specific. >> > > I think more of complex analysis and it's use of series. > > >> Apart from the series and certain filter design actual usage of >> polynomials are always presented with decreasing order (control and signal >> processing included because they use powers of s and inverse powers of z if >> needed). So if that is the use case then probably it should go under a >> namespace of `TimeSeries` or at least require an option to present it in >> reverse. In my opinion polynomials are way more general than that domain >> and to everyone else it seems to me that "the intuitive way" is the >> decreasing powers. >> >> > In approximation, say by Chebyshev polynomials, the coefficients will > typically drop off sharply above a certain degree. This has two effects, > first, the coefficients that one really cares about are of low degree and > should come first, and second, one can truncate the coefficients easily > with c[:n]. So in this usage ordering by increasing degree is natural. This > is the series idea, fundamental to analysis. > > Algebraically, interest centers on the degree of the polynomial, which > determines the number of zeros and general shape, consequently from the > point of view of the algebraist, working with polynomials of finite > predetermined degree, arranging the coefficients in order of decreasing > degree makes sense and is traditional. > > That said, I am not actually sure where the high to low ordering of > polynomials came from. It could even be like the Arabic numeral system, > which when read properly from right to left, has its terms arranged from > small to greater. It may even be that the polynomial convention derives > that of the Arabic numerals. > > > > Chuck > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sat Jun 30 17:41:28 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sat, 30 Jun 2018 14:41:28 -0700 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Since the one of the arguments for the decreasing order seems to just be textual representation - do we want to tweak the repr to something like Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) (And add a constructor that calls the lambda with Polynomial(1)) Eric ? On Sat, 30 Jun 2018 at 14:30 Eric Wieser wrote: > ?the intuitive way? is the decreasing powers. > > An argument against this is that accessing the ith power of x is spelt: > > - x.coeffs[i] for increasing powers > - x.coeffs[-i-1] for decreasing powers > > The former is far more natural than the latter, and avoids a potential > off-by-one error > > If I ask someone to write down the coefficients of a polynomial I don?t > think anyone would start from c[2] > > You wouldn?t? I?d expect to see > > [image: f(x) = a_3x^3 + a_2x^2 + a_1x + a_0] > > rather than > > [image: f(x) = a_0x^3 + a_1x^2 + a_2x + a_3] > > Sure, I?d write it starting with the highest power, but I?d still number > my coefficients to match the powers. > > > Eric > ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maifer at haverford.edu Sat Jun 30 18:05:12 2018 From: maifer at haverford.edu (Maxwell Aifer) Date: Sat, 30 Jun 2018 18:05:12 -0400 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Oh, clever... yeah I think that would be very cool. But shouldn't it call the constructor with Polynomial([0,1])? On Sat, Jun 30, 2018 at 5:41 PM, Eric Wieser wrote: > Since the one of the arguments for the decreasing order seems to just be > textual representation - do we want to tweak the repr to something like > > Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) > > (And add a constructor that calls the lambda with Polynomial(1)) > > Eric > ? > > On Sat, 30 Jun 2018 at 14:30 Eric Wieser > wrote: > >> ?the intuitive way? is the decreasing powers. >> >> An argument against this is that accessing the ith power of x is spelt: >> >> - x.coeffs[i] for increasing powers >> - x.coeffs[-i-1] for decreasing powers >> >> The former is far more natural than the latter, and avoids a potential >> off-by-one error >> >> If I ask someone to write down the coefficients of a polynomial I don?t >> think anyone would start from c[2] >> >> You wouldn?t? I?d expect to see >> >> [image: f(x) = a_3x^3 + a_2x^2 + a_1x + a_0] >> >> rather than >> >> [image: f(x) = a_0x^3 + a_1x^2 + a_2x + a_3] >> >> Sure, I?d write it starting with the highest power, but I?d still number >> my coefficients to match the powers. >> >> >> Eric >> ? >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maifer at haverford.edu Sat Jun 30 18:06:46 2018 From: maifer at haverford.edu (Maxwell Aifer) Date: Sat, 30 Jun 2018 18:06:46 -0400 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: *shouldn't the constructor call the lambda with Polynomial([0,1[) On Sat, Jun 30, 2018 at 6:05 PM, Maxwell Aifer wrote: > Oh, clever... yeah I think that would be very cool. But shouldn't it call > the constructor with Polynomial([0,1])? > > On Sat, Jun 30, 2018 at 5:41 PM, Eric Wieser > wrote: > >> Since the one of the arguments for the decreasing order seems to just be >> textual representation - do we want to tweak the repr to something like >> >> Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) >> >> (And add a constructor that calls the lambda with Polynomial(1)) >> >> Eric >> ? >> >> On Sat, 30 Jun 2018 at 14:30 Eric Wieser >> wrote: >> >>> ?the intuitive way? is the decreasing powers. >>> >>> An argument against this is that accessing the ith power of x is spelt: >>> >>> - x.coeffs[i] for increasing powers >>> - x.coeffs[-i-1] for decreasing powers >>> >>> The former is far more natural than the latter, and avoids a potential >>> off-by-one error >>> >>> If I ask someone to write down the coefficients of a polynomial I don?t >>> think anyone would start from c[2] >>> >>> You wouldn?t? I?d expect to see >>> >>> [image: f(x) = a_3x^3 + a_2x^2 + a_1x + a_0] >>> >>> rather than >>> >>> [image: f(x) = a_0x^3 + a_1x^2 + a_2x + a_3] >>> >>> Sure, I?d write it starting with the highest power, but I?d still number >>> my coefficients to match the powers. >>> >>> >>> Eric >>> ? >>> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion at python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wieser.eric+numpy at gmail.com Sat Jun 30 18:08:40 2018 From: wieser.eric+numpy at gmail.com (Eric Wieser) Date: Sat, 30 Jun 2018 15:08:40 -0700 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: Good catch, it would do that On Sat, 30 Jun 2018 at 15:07 Maxwell Aifer wrote: > *shouldn't the constructor call the lambda with Polynomial([0,1[) > > On Sat, Jun 30, 2018 at 6:05 PM, Maxwell Aifer > wrote: > >> Oh, clever... yeah I think that would be very cool. But shouldn't it call >> the constructor with Polynomial([0,1])? >> >> On Sat, Jun 30, 2018 at 5:41 PM, Eric Wieser > > wrote: >> >>> Since the one of the arguments for the decreasing order seems to just be >>> textual representation - do we want to tweak the repr to something like >>> >>> Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) >>> >>> (And add a constructor that calls the lambda with Polynomial(1)) >>> >>> Eric >>> ? >>> >>> On Sat, 30 Jun 2018 at 14:30 Eric Wieser >>> wrote: >>> >>>> ?the intuitive way? is the decreasing powers. >>>> >>>> An argument against this is that accessing the ith power of x is spelt: >>>> >>>> - x.coeffs[i] for increasing powers >>>> - x.coeffs[-i-1] for decreasing powers >>>> >>>> The former is far more natural than the latter, and avoids a potential >>>> off-by-one error >>>> >>>> If I ask someone to write down the coefficients of a polynomial I don?t >>>> think anyone would start from c[2] >>>> >>>> You wouldn?t? I?d expect to see >>>> >>>> [image: f(x) = a_3x^3 + a_2x^2 + a_1x + a_0] >>>> >>>> rather than >>>> >>>> [image: f(x) = a_0x^3 + a_1x^2 + a_2x + a_3] >>>> >>>> Sure, I?d write it starting with the highest power, but I?d still >>>> number my coefficients to match the powers. >>>> >>>> >>>> Eric >>>> ? >>>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion at python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 30 18:47:10 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 30 Jun 2018 16:47:10 -0600 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 4:42 PM, Charles R Harris wrote: > > > On Sat, Jun 30, 2018 at 3:41 PM, Eric Wieser > wrote: > >> Since the one of the arguments for the decreasing order seems to just be >> textual representation - do we want to tweak the repr to something like >> >> Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) >> >> (And add a constructor that calls the lambda with Polynomial(1)) >> >> Eric >> > > IIRC there was a proposal for that. There is the possibility of adding > renderers for latex and html that could be used by Jupyter, and I think the > ordering was an option. > See https://github.com/numpy/numpy/issues/8893 for the proposal. BTW, if someone would like to work on this, go for it. Chuck > ? >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charlesr.harris at gmail.com Sat Jun 30 18:42:45 2018 From: charlesr.harris at gmail.com (Charles R Harris) Date: Sat, 30 Jun 2018 16:42:45 -0600 Subject: [Numpy-discussion] Polynomial evaluation inconsistencies In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 3:41 PM, Eric Wieser wrote: > Since the one of the arguments for the decreasing order seems to just be > textual representation - do we want to tweak the repr to something like > > Polynomial(lambda x: 2*x**3 + 3*x**2 + x + 0) > > (And add a constructor that calls the lambda with Polynomial(1)) > > Eric > IIRC there was a proposal for that. There is the possibility of adding renderers for latex and html that could be used by Jupyter, and I think the ordering was an option. Chuck > ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Jun 30 22:23:48 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 30 Jun 2018 19:23:48 -0700 Subject: [Numpy-discussion] Revised NEP-18, __array_function__ protocol In-Reply-To: References: Message-ID: On Sat, Jun 30, 2018 at 12:14 PM Stephan Hoyer wrote: > I?d love to see a generic way of doing random number generation, but I > agree with Martin that I don?t see it fitting a naturally into this NEP. An > invasive change to add an array_reference argument to a bunch of functions > might indeed be worthy of its own NEP, but again I?m not convinced that?s > actually the right approach. I?d rather add a few new functions like > random_like, which is a small enough change that concensus on the list > might be enough. > random_like() seems very weird to me. It doesn't seem like a function that anyone actually wants. It seems like what people actually want is to be able to draw random numbers from any distribution as a specified array-like type and shape, not just sample U(0, 1) with the shape of an existing array. The most workable way to do this is to modify RandomGenerator (i.e. the new RandomState design)[1] to accept the array-like type in the class constructor, and modify its internals to do the right thing. Because the intrusion on the API is so small, that doesn't require a NEP, just a PR (a long, complicated, and tedious PR, to be sure)[2]. There are a bunch of technical issues (if you want to avoid memory copies) because the Cython implementation requires direct memory access, but that's intrinsic to any solution to this problem, regardless of the API choices. random_like() would have the same issues. [1] https://github.com/bashtage/randomgen [2] Sorry, Kevin. -- Robert Kern -------------- next part -------------- An HTML attachment was scrubbed... URL: