[OT] Starving CPUs article featured in IEEE's ComputingNow portal
Hi, Konrad Hinsen has just told me that my article "Why Modern CPUs Are Starving and What Can Be Done About It", which has just released on the March/April issue of "Computing in Science and Engineering", also made into this month's free-access selection on IEEE's ComputingNow portal: http://www.computer.org/portal/web/computingnow http://www.computer.org/portal/web/computingnow/0310/whatsnew/cise On it, I discuss one of my favourite subjects, memory access, and why it is important for nowadays computing. There are also recommendations for people wanting to squeeze the maximum of performance out of their computers. And, although I tried to be as language-agnostic as I could, there can be seen some Python references here and there :-). Well, sorry about this semi-OT but I could not resist :-) -- Francesc Alted
On 18 March 2010 09:57, Francesc Alted <faltet@pytables.org> wrote:
Hi,
Konrad Hinsen has just told me that my article "Why Modern CPUs Are Starving and What Can Be Done About It", which has just released on the March/April issue of "Computing in Science and Engineering", also made into this month's free-access selection on IEEE's ComputingNow portal:
Speak for your own CPUs :). But seriously, congratulations on the wide publication of the article; it's an important issue we often don't think enough about. I'm just a little snarky because this exact issue came up for us recently - a visiting astro speaker put it as "flops are free" - and so I did some tests and found that even without optimizing for memory access, our tasks are already CPU-bound: http://lighthouseinthesky.blogspot.com/2010/03/flops.html In terms of specifics, I was a little surprised you didn't mention FFTW among your software tools that optimize memory access. FFTW's planning scheme seems ideal for ensuring memory locality, as much as possible, during large FFTs. (And in fact I also found that for really large FFTs, reducing padding - memory size - at the cost of a non-power-of-two size was also worth it.)
http://www.computer.org/portal/web/computingnow http://www.computer.org/portal/web/computingnow/0310/whatsnew/cise
On it, I discuss one of my favourite subjects, memory access, and why it is important for nowadays computing. There are also recommendations for people wanting to squeeze the maximum of performance out of their computers. And, although I tried to be as language-agnostic as I could, there can be seen some Python references here and there :-).
Heh. Indeed numexpr is a good tool for this sort of thing; it's an unfortunate fact that simple use of numpy tends to do operations in the pessimal order... Ane
Well, sorry about this semi-OT but I could not resist :-)
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
A Thursday 18 March 2010 16:26:09 Anne Archibald escrigué:
Speak for your own CPUs :).
But seriously, congratulations on the wide publication of the article; it's an important issue we often don't think enough about. I'm just a little snarky because this exact issue came up for us recently - a visiting astro speaker put it as "flops are free" - and so I did some tests and found that even without optimizing for memory access, our tasks are already CPU-bound: http://lighthouseinthesky.blogspot.com/2010/03/flops.html
Well, I thought that my introduction was enough to convince anybody about the problem, but forgot that you, the scientists, always try to demonstrate things experimentally :-/ Seriously, your example is a clear example of what I'm recommending in the article, i.e. always try to use libraries that are already leverage the blocking technique (that is, taking advantage of both temporal and spatial locality). Don't know about FFTW (never used it, sorry), but after having a look at its home page, I'm pretty convinced that its authors are very conscious about these techniques. Being said this, it seems that, in addition, you are applying the blocking technique yourself also: get the data in bunches (256 floating point elements, which fits perfectly well on modern L1 caches), apply your computation (in this case, FFTW) and put the result back in memory. A perfect example of what I wanted to show to the readers so, congratulations! you made it without the need to read my article (so perhaps the article was not so necessary after all :-)
In terms of specifics, I was a little surprised you didn't mention FFTW among your software tools that optimize memory access. FFTW's planning scheme seems ideal for ensuring memory locality, as much as possible, during large FFTs. (And in fact I also found that for really large FFTs, reducing padding - memory size - at the cost of a non-power-of-two size was also worth it.)
I must say that I'm quite naïve in many existing great tools for scientific computing. What I know, is that when I need to do something I always look for good existing tools first. So this is basically why I spoke about numexpr and BLAS/LAPACK: I know them well.
Heh. Indeed numexpr is a good tool for this sort of thing; it's an unfortunate fact that simple use of numpy tends to do operations in the pessimal order...
Well, to honor the truth, NumPy does not have control in the order of the operations in expressions and how temporaries are managed: it is Python who decides that. NumPy only can do what Python wants it to do, and do it as good as possible. And NumPy plays its role reasonably well here, but of course, this is not enough for providing performance. In fact, this problem probably affects to all interpreted languages out there, unless they implement a JIT compiler optimised for evaluating expressions --and this is basically what numexpr is. Anyway, thanks for constructive criticism, I really appreciate it! -- Francesc Alted
On 18 March 2010 13:53, Francesc Alted <faltet@pytables.org> wrote:
A Thursday 18 March 2010 16:26:09 Anne Archibald escrigué:
Speak for your own CPUs :).
But seriously, congratulations on the wide publication of the article; it's an important issue we often don't think enough about. I'm just a little snarky because this exact issue came up for us recently - a visiting astro speaker put it as "flops are free" - and so I did some tests and found that even without optimizing for memory access, our tasks are already CPU-bound: http://lighthouseinthesky.blogspot.com/2010/03/flops.html
Well, I thought that my introduction was enough to convince anybody about the problem, but forgot that you, the scientists, always try to demonstrate things experimentally :-/
Snrk. Well, technically, that is our job description...
Seriously, your example is a clear example of what I'm recommending in the article, i.e. always try to use libraries that are already leverage the blocking technique (that is, taking advantage of both temporal and spatial locality). Don't know about FFTW (never used it, sorry), but after having a look at its home page, I'm pretty convinced that its authors are very conscious about these techniques.
Being said this, it seems that, in addition, you are applying the blocking technique yourself also: get the data in bunches (256 floating point elements, which fits perfectly well on modern L1 caches), apply your computation (in this case, FFTW) and put the result back in memory. A perfect example of what I wanted to show to the readers so, congratulations! you made it without the need to read my article (so perhaps the article was not so necessary after all :-)
What I didn't go into in detail in the article was that there's a trade-off of processing versus memory access available: we could reduce the memory load by a factor of eight by doing interpolation on the fly instead of all at once in a giant FFT. But that would cost cache space and flops, and we're not memory-dominated. One thing I didn't try, and should: running four of these jobs at once on a four-core machine. If I correctly understand the architecture, that won't affect the cache issues, but it will effectively quadruple the memory bandwidth needed, without increasing the memory bandwidth available. (Which, honestly, makes me wonder what the point is of building multicore machines.) Maybe I should look into that interpolation stuff.
Heh. Indeed numexpr is a good tool for this sort of thing; it's an unfortunate fact that simple use of numpy tends to do operations in the pessimal order...
Well, to honor the truth, NumPy does not have control in the order of the operations in expressions and how temporaries are managed: it is Python who decides that. NumPy only can do what Python wants it to do, and do it as good as possible. And NumPy plays its role reasonably well here, but of course, this is not enough for providing performance. In fact, this problem probably affects to all interpreted languages out there, unless they implement a JIT compiler optimised for evaluating expressions --and this is basically what numexpr is.
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.) But a language designed from scratch for vector calculations could certainly compile expressions into a form that would save a lot of memory accesses, particularly if an optimizer combined many lines of code. I've actually thought about whether such a thing could be done in python; I think the way to do it would be to build expression objects from variable objects, then have a single "apply" function that fed values in to all the variables. The same framework would support automatic differentiation and other cool things, but I'm not sure it would be useful enough to be worth the implementation complexity. Anne
Anne Archibald wrote:
(Which, honestly, makes me wonder what the point is of building multicore machines.)
Advertising... Oh, and having multiple cores slows down multi-threading in Python, so that feature is worth the expense!
But a language designed from scratch for vector calculations
Ever hear of ZPL? http://www.cs.washington.edu/research/zpl/home/index.html I think it's pretty much a dead project, but I always liked the idea. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 19-Mar-10, at 1:13 PM, Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.) But a language designed from scratch for vector calculations could certainly compile expressions into a form that would save a lot of memory accesses, particularly if an optimizer combined many lines of code. I've actually thought about whether such a thing could be done in python; I think the way to do it would be to build expression objects from variable objects, then have a single "apply" function that fed values in to all the variables.
Hey Anne, Some folks across town from you at U de M have built just such at thing. :) http://deeplearning.net/software/theano/ It does all that, plus automatic differentiation, detection and correction of numerical instabilities, etc. Probably the most amazing thing about it is that with recent versions, you basically flip a switch and it will instead use an available CUDA- capable Nvidia GPU instead of the CPU. I'll admit, when James Bergstra initially told me about this plan to make it possible to transparently switch to running stuff on the GPU, I thought it was so ambitious that it would never happen. Then it did... David
A Friday 19 March 2010 18:13:33 Anne Archibald escrigué: [clip]
What I didn't go into in detail in the article was that there's a trade-off of processing versus memory access available: we could reduce the memory load by a factor of eight by doing interpolation on the fly instead of all at once in a giant FFT. But that would cost cache space and flops, and we're not memory-dominated.
One thing I didn't try, and should: running four of these jobs at once on a four-core machine. If I correctly understand the architecture, that won't affect the cache issues, but it will effectively quadruple the memory bandwidth needed, without increasing the memory bandwidth available. (Which, honestly, makes me wonder what the point is of building multicore machines.)
Maybe I should look into that interpolation stuff.
Please do. Although you may be increasing the data rate by 4x, your program is already very efficient in how it handles data, so chances are that you still get a good speed-up. I'd glad to hear you back on your experience. -- Francesc Alted
On 20 March 2010 06:32, Francesc Alted <faltet@pytables.org> wrote:
A Friday 19 March 2010 18:13:33 Anne Archibald escrigué: [clip]
What I didn't go into in detail in the article was that there's a trade-off of processing versus memory access available: we could reduce the memory load by a factor of eight by doing interpolation on the fly instead of all at once in a giant FFT. But that would cost cache space and flops, and we're not memory-dominated.
One thing I didn't try, and should: running four of these jobs at once on a four-core machine. If I correctly understand the architecture, that won't affect the cache issues, but it will effectively quadruple the memory bandwidth needed, without increasing the memory bandwidth available. (Which, honestly, makes me wonder what the point is of building multicore machines.)
Maybe I should look into that interpolation stuff.
Please do. Although you may be increasing the data rate by 4x, your program is already very efficient in how it handles data, so chances are that you still get a good speed-up. I'd glad to hear you back on your experience.
The thing is, it reduces the data rate from memory, but at the cost of additional FFTs (to implement convolutions). If my program is already spending all its time doing FFTs, and the loads from memory are happening while the CPU does FFTs, then there's no win in runtime from reducing the memory load, and there's a runtime cost of doing those convolutions - not just more flops but also more cache pressure (to store the interpolated array and the convolution kernels). One could go a further step and do interpolation directly, without convolution, but that adds really a lot of flops, which translates directly to runtime. On the other hand, if it doesn't completely blow out the cache, we do have non-interpolated FFTs already on disk (with red noise adjustments already applied), so we might save on the relatively minor cost of the giant FFT. I'll have to do some time trials. Anne
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.)
Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC dimensions are always traversed in order from left to right. Large speedups are possible in some cases, but in a quick try I didn't manage to come up with an algorithm that would always improve the speed (there was a thread about this last year or so, and there's a ticket). Things varied between computers, so this probably depends a lot on the actual cache arrangement. But perhaps numexpr has such heuristics, and we could steal them? -- Pauli Virtanen
Pauli Virtanen wrote:
Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.)
Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC dimensions are always traversed in order from left to right. Large speedups are possible in some cases, but in a quick try I didn't manage to come up with an algorithm that would always improve the speed (there was a thread about this last year or so, and there's a ticket). Things varied between computers, so this probably depends a lot on the actual cache arrangement.
But perhaps numexpr has such heuristics, and we could steal them?
At least in MultiIter (and I always assumed ufuncs too, but perhaps not) there's functionality to remove the largest dimension so that it can be put innermost in a loop. In many situations, removing the dimension with the smallest stride from the iterator would probably work much better. It's all about balancing iterator overhead and memory overhead. Something simple like "select the dimension with length > 200 which has smallest stride, or the dimension with largest length if none are above 200" would perhaps work well? Dag Sverre
On 20 March 2010 14:56, Dag Sverre Seljebotn <dagss@student.matnat.uio.no> wrote:
Pauli Virtanen wrote:
Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.)
Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC dimensions are always traversed in order from left to right. Large speedups are possible in some cases, but in a quick try I didn't manage to come up with an algorithm that would always improve the speed (there was a thread about this last year or so, and there's a ticket). Things varied between computers, so this probably depends a lot on the actual cache arrangement.
But perhaps numexpr has such heuristics, and we could steal them?
At least in MultiIter (and I always assumed ufuncs too, but perhaps not) there's functionality to remove the largest dimension so that it can be put innermost in a loop. In many situations, removing the dimension with the smallest stride from the iterator would probably work much better.
It's all about balancing iterator overhead and memory overhead. Something simple like "select the dimension with length > 200 which has smallest stride, or the dimension with largest length if none are above 200" would perhaps work well?
There's more to it than that: there's no point (I think) going with the smallest stride if that stride is more than a cache line (usually 64 bytes). There are further optimizations - often stride[i]*shape[i]==shape[j] for some (i,j), and in that case these two dimensions can be treated as one with stride stride[i] and length shape[i]*shape[j]. Applying this would "unroll" all C- or Fortran-contiguous arrays to a single non-nested loop. Of course, reordering ufuncs is potentially hazardous in terms of breaking user code: the effect of A[1:]*=A[:-1] depends on whether you run through A in index order or reverse index order (which might be the order it's stored in memory, potentially faster from a cache point of view). Anne
Dag Sverre
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Mar 20, 2010 at 8:22 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
On 20 March 2010 14:56, Dag Sverre Seljebotn <dagss@student.matnat.uio.no> wrote:
Pauli Virtanen wrote:
Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.)
Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC dimensions are always traversed in order from left to right. Large speedups are possible in some cases, but in a quick try I didn't manage to come up with an algorithm that would always improve the speed (there was a thread about this last year or so, and there's a ticket). Things varied between computers, so this probably depends a lot on the actual cache arrangement.
But perhaps numexpr has such heuristics, and we could steal them?
At least in MultiIter (and I always assumed ufuncs too, but perhaps not) there's functionality to remove the largest dimension so that it can be put innermost in a loop. In many situations, removing the dimension with the smallest stride from the iterator would probably work much better.
It's all about balancing iterator overhead and memory overhead. Something simple like "select the dimension with length > 200 which has smallest stride, or the dimension with largest length if none are above 200" would perhaps work well?
There's more to it than that: there's no point (I think) going with the smallest stride if that stride is more than a cache line (usually 64 bytes). There are further optimizations - often stride[i]*shape[i]==shape[j] for some (i,j), and in that case these two dimensions can be treated as one with stride stride[i] and length shape[i]*shape[j]. Applying this would "unroll" all C- or Fortran-contiguous arrays to a single non-nested loop.
Of course, reordering ufuncs is potentially hazardous in terms of breaking user code: the effect of A[1:]*=A[:-1] depends on whether you run through A in index order or reverse index order (which might be the order it's stored in memory, potentially faster from a cache point of view).
I think Travis O suggested at some point to make this kind of "overlapping inplace" operations invalid. (maybe it was someone else ...) The idea was to protect new ("uninitiated") users from unexpected results. Anyway, maybe one could use (something like) np.seterror to determine how to handle these cases... -Sebastian
On 20 March 2010 16:18, Sebastian Haase <seb.haase@gmail.com> wrote:
On Sat, Mar 20, 2010 at 8:22 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
On 20 March 2010 14:56, Dag Sverre Seljebotn <dagss@student.matnat.uio.no> wrote:
Pauli Virtanen wrote:
Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.)
Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC dimensions are always traversed in order from left to right. Large speedups are possible in some cases, but in a quick try I didn't manage to come up with an algorithm that would always improve the speed (there was a thread about this last year or so, and there's a ticket). Things varied between computers, so this probably depends a lot on the actual cache arrangement.
But perhaps numexpr has such heuristics, and we could steal them?
At least in MultiIter (and I always assumed ufuncs too, but perhaps not) there's functionality to remove the largest dimension so that it can be put innermost in a loop. In many situations, removing the dimension with the smallest stride from the iterator would probably work much better.
It's all about balancing iterator overhead and memory overhead. Something simple like "select the dimension with length > 200 which has smallest stride, or the dimension with largest length if none are above 200" would perhaps work well?
There's more to it than that: there's no point (I think) going with the smallest stride if that stride is more than a cache line (usually 64 bytes). There are further optimizations - often stride[i]*shape[i]==shape[j] for some (i,j), and in that case these two dimensions can be treated as one with stride stride[i] and length shape[i]*shape[j]. Applying this would "unroll" all C- or Fortran-contiguous arrays to a single non-nested loop.
Of course, reordering ufuncs is potentially hazardous in terms of breaking user code: the effect of A[1:]*=A[:-1] depends on whether you run through A in index order or reverse index order (which might be the order it's stored in memory, potentially faster from a cache point of view).
I think Travis O suggested at some point to make this kind of "overlapping inplace" operations invalid. (maybe it was someone else ...) The idea was to protect new ("uninitiated") users from unexpected results. Anyway, maybe one could use (something like) np.seterror to determine how to handle these cases...
I was in on that discussion. My recollection of the conclusion was that on the one hand they're useful, carefully applied, while on the other hand they're very difficult to reliably detect (since you don't want to forbid operations on non-overlapping slices of the same array). Anne
-Sebastian _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Mar 19, 2010 at 11:18 PM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
On 19-Mar-10, at 1:13 PM, Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.) But a language designed from scratch for vector calculations could certainly compile expressions into a form that would save a lot of memory accesses, particularly if an optimizer combined many lines of code. I've actually thought about whether such a thing could be done in python; I think the way to do it would be to build expression objects from variable objects, then have a single "apply" function that fed values in to all the variables.
Hey Anne,
Some folks across town from you at U de M have built just such at thing. :)
http://deeplearning.net/software/theano/
It does all that, plus automatic differentiation, detection and correction of numerical instabilities, etc.
Probably the most amazing thing about it is that with recent versions, you basically flip a switch and it will instead use an available CUDA- capable Nvidia GPU instead of the CPU. I'll admit, when James Bergstra initially told me about this plan to make it possible to transparently switch to running stuff on the GPU, I thought it was so ambitious that it would never happen. Then it did...
The progress Theano is making is promising. I had several times a look at theano and I like the idea of code generation, especially the numpy support. I hope it may be useful for one of my projects in the future. What I couldn't figure out from the documentation is the actual performance and ease of use. Am I right with the assumption that you are not a Theano dev? Have you used Theano in a project? What are you experiences? Do you happen to know how big the computational graphs can be? Is there the possibility to have loops and if then else statements? Sorry for being a little offtopic here. Sebastian
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Mar 21, 2010 at 1:13 PM, Sebastian Walter <sebastian.walter@gmail.com> wrote:
On Fri, Mar 19, 2010 at 11:18 PM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
On 19-Mar-10, at 1:13 PM, Anne Archibald wrote:
I'm not knocking numpy; it does (almost) the best it can. (I'm not sure of the optimality of the order in which ufuncs are executed; I think some optimizations there are possible.) But a language designed from scratch for vector calculations could certainly compile expressions into a form that would save a lot of memory accesses, particularly if an optimizer combined many lines of code. I've actually thought about whether such a thing could be done in python; I think the way to do it would be to build expression objects from variable objects, then have a single "apply" function that fed values in to all the variables.
Hey Anne,
Some folks across town from you at U de M have built just such at thing. :)
http://deeplearning.net/software/theano/
It does all that, plus automatic differentiation, detection and correction of numerical instabilities, etc.
Probably the most amazing thing about it is that with recent versions, you basically flip a switch and it will instead use an available CUDA- capable Nvidia GPU instead of the CPU. I'll admit, when James Bergstra initially told me about this plan to make it possible to transparently switch to running stuff on the GPU, I thought it was so ambitious that it would never happen. Then it did...
The progress Theano is making is promising. I had several times a look at theano and I like the idea of code generation, especially the numpy support. I hope it may be useful for one of my projects in the future.
What I couldn't figure out from the documentation is the actual performance and ease of use. Am I right with the assumption that you are not a Theano dev? Have you used Theano in a project? What are you experiences? Do you happen to know how big the computational graphs can be? Is there the possibility to have loops and if then else statements?
Sorry for being a little offtopic here.
I encourage you to sign up for theano-users@googlegroup.com if you want to keep an eye on things. If you fwd this to theano-users I'd be happy answer it there. James -- http://www-etud.iro.umontreal.ca/~bergstrj
la, 2010-03-20 kello 17:36 -0400, Anne Archibald kirjoitti:
I was in on that discussion. My recollection of the conclusion was that on the one hand they're useful, carefully applied, while on the other hand they're very difficult to reliably detect (since you don't want to forbid operations on non-overlapping slices of the same array).
I think one alternative brought up was copy if unsure whether the slices overlap which would make A[whatever] = A[whatever2] be always identical in functionality to A[whatever] = A[whatever2].copy() which is how things should work. This would permit optimizing simple cases (at least 1D), and avoids running into NP-completeness (for numpy, the exponential growth is however limited by NPY_MAXDIMS which is 64, IIRC). This would be a change in semantics, but in a very obscure corner that hopefully nobody relies on. -- Pauli Virtanen
On 22 March 2010 14:42, Pauli Virtanen <pav@iki.fi> wrote:
la, 2010-03-20 kello 17:36 -0400, Anne Archibald kirjoitti:
I was in on that discussion. My recollection of the conclusion was that on the one hand they're useful, carefully applied, while on the other hand they're very difficult to reliably detect (since you don't want to forbid operations on non-overlapping slices of the same array).
I think one alternative brought up was
copy if unsure whether the slices overlap
which would make
A[whatever] = A[whatever2]
be always identical in functionality to
A[whatever] = A[whatever2].copy()
which is how things should work. This would permit optimizing simple cases (at least 1D), and avoids running into NP-completeness (for numpy, the exponential growth is however limited by NPY_MAXDIMS which is 64, IIRC).
It can produce surprise copies, but I could certainly live with this. Or maybe a slight modification: "always produces values equivalent to using a copy", to allow handling the common A[:-1]=A[1:] without a copy. Of course, we'd have to wait for someone to implement it...
This would be a change in semantics, but in a very obscure corner that hopefully nobody relies on.
It would certainly be nice to replace unspecified behaviour by specified behaviour if it can be done with minimal cost. And I think it could be, with some careful programming. Anne
participants (9)
-
Anne Archibald
-
Christopher Barker
-
Dag Sverre Seljebotn
-
David Warde-Farley
-
Francesc Alted
-
James Bergstra
-
Pauli Virtanen
-
Sebastian Haase
-
Sebastian Walter