Mailman 3 array copy-to-self and views - NumPy-Discussion

array copy-to-self and views

Zachary Pincus

1 Feb 2007 1 Feb '07

8:28 a.m.

Hello folks, I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array) via the standard python idiom for lists, e.g.: a[:] = numpy.flipud(a) Now, flipud returns a view on 'a', so assigning that to 'a[:]' provides pretty strange results as the buffer that is being read (the view) is simultaneously modified. Here is an example: In [2]: a = numpy.arange(10).reshape((5,2)) In [3]: a Out[3]: array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) In [4]: numpy.flipud(a) Out[4]: array([[8, 9], [6, 7], [4, 5], [2, 3], [0, 1]]) In [5]: a[:] = numpy.flipud(a) In [6]: a Out[6]: array([[8, 9], [6, 7], [4, 5], [6, 7], [8, 9]]) A question, then: Does this represent a bug? Or perhaps there is a better idiom for modifying an array in-place than 'a[:] = ...'? Or is incumbent on the user to ensure that any time an array is directly modified, that the modifying array is not a view of the original array? Thanks for any thoughts, Zach Pincus Program in Biomedical Informatics and Department of Biochemistry Stanford University School of Medicine

Show replies by date

Anne Archibald

1 Feb 1 Feb

9:07 a.m.

On 01/02/07, Zachary Pincus wrote:

...

I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array) via the standard python idiom for lists, e.g.:

You can do this, but if your concern is invalidating references, you might want to think about rearranging your application so you can just return "new" arrays (that may share elements), if that is possible.

...

a[:] = numpy.flipud(a)

Now, flipud returns a view on 'a', so assigning that to 'a[:]' provides pretty strange results as the buffer that is being read (the view) is simultaneously modified. Here is an example:

...

A question, then: Does this represent a bug? Or perhaps there is a better idiom for modifying an array in-place than 'a[:] = ...'? Or is incumbent on the user to ensure that any time an array is directly modified, that the modifying array is not a view of the original array?

It's the user's job to keep them separate. Sorry. If you're worried - say if it's an array you don't have much control over (so it might share elements without you knowing), you can either return a new array, or if you must modify it in place, copy the right-hand side before using it (a[:]=flipud(a).copy()). It would in principle be possible for numpy to provide a function that tells you if two arrays might share data (simply compare the pointer to the malloc()ed storage and ignore strides and offset; a bit conservative but probably Good Enough, though a bit more cleverness should let one get the Right Answer efficiently). Anne M. Archibald

Christopher Barker

5:38 p.m.

Zachary Pincus wrote:

...

Hello folks,

I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array)

I'm not sure what this means exactly.

...

via the standard python idiom for lists, e.g.:

a[:] = numpy.flipud(a)

Now, flipud returns a view on 'a', so assigning that to 'a[:]' provides pretty strange results as the buffer that is being read (the view) is simultaneously modified.

yes, weird. So why not just: a = numpy.flipud(a) Since flipud returns a view, the new "a" will still be using the same data array. Does this satisfy your need above? You've created a new numpy array object, but that was created by flipud anyway, so there is no performance loss. It's too bad that to do this you need to know that flipud created a view, rather than a copy of the data, as if it were a copy, you would need to do the a[:] trick to make sure a kept the same data, but that's the price we pay for the flexibility and power of numpy -- the alternative is to have EVERYTHING create a copy, but there were be a substantial performance hit for that. NOTE: the docstring doesn't make it clear that a view is created:

...

...
...
help(numpy.flipud) Help on function flipud in module numpy.lib.twodim_base:

flipud(m) returns an array with the columns preserved and rows flipped in the up/down direction. Works on the first dimension of m. NOTE2: Maybe these kinds of functions should have an optional flag that specified whether you want a view or a copy -- I'd have expected a copy in this case! QUESTION: How do you tell if two arrays are views on the same data: is checking if they have the same .base reliable?

...

...
...
a = numpy.array((1,2,3,4)) b = a.view() a.base is b.base False

No, I guess not. Maybe .base should return self if it's the originator of the data. Is there a reliable way? I usually just test by changing a value in one to see if it changes in the other, but that's one heck of kludge!

...

...
...
a.__array_interface__['data'][0] == b.__array_interface__['data'][0] True

seems to work, but that's pretty ugly! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Zachary Pincus

6:52 p.m.

...

Zachary Pincus wrote:

...
Hello folks,

I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array)

I'm not sure what this means exactly.

Say one wants to keep two different variables referencing a single in- memory list, as so: a = [1,2,3] b = a Now, if 'b' and 'a' go to live in different places (different class instances or whatever) but we want 'b' and 'a' to always refer to the same in-memory object, so that 'id(a) == id(b)', we need to make sure to not assign a brand new list to either one. That is, if we do something like 'a = [i + 1 for i in a]' then 'id (a) != id(b)'. However, we can do 'a[:] = [i + 1 for i in a]' to modify a in-place. This is not super-common, but it's also not an uncommon python idiom. I was in my email simply pointing out that naïvely translating that idiom to the numpy case can cause unexpected behavior in the case of views. I think that this is is unquestionably a bug -- isn't the point of views that the user shouldn't need to care if a particular array object is a view or not? Given the lack of methods to query whether an array is a view, or what it might be a view on, this seems like a reasonable perspective... I mean, if certain operations produce completely different results when one of the operands is a view, that *seems* like a bug. It might not be worth fixing, but I can't see how that behavior would be considered a feature. However, I do think there's a legitimate question about whether it would be worth fixing -- there could be a lot of complicated checks to catch these kind of corner cases.

...

...
via the standard python idiom for lists, e.g.:

a[:] = numpy.flipud(a)

Now, flipud returns a view on 'a', so assigning that to 'a[:]' provides pretty strange results as the buffer that is being read (the view) is simultaneously modified.

yes, weird. So why not just:

a = numpy.flipud(a)

Since flipud returns a view, the new "a" will still be using the same data array. Does this satisfy your need above?

Nope -- though 'a' and 'numpy.flipud(a)' share the same data, the actual ndarray instances are different. This means that any other references to the 'a' array (made via 'b = a' or whatever) now refer to the old 'a', not the flipped one. The only other option for sharing arrays is to encapsulate them as attributes of *another* object, which itself won't change. That seems a bit clumsy.

...

It's too bad that to do this you need to know that flipud created a view, rather than a copy of the data, as if it were a copy, you would need to do the a[:] trick to make sure a kept the same data, but that's the price we pay for the flexibility and power of numpy -- the alternative is to have EVERYTHING create a copy, but there were be a substantial performance hit for that.

Well, Anne's email suggests another alternative -- each time a view is created, keep track of the original array from whence it came, and then only make a copy when collisions like the above would take place. And actually, I suspect that views already need to keep a reference to their original array in order to keep that array from being deleted before the view is. But I don't know the guts of numpy well enough to say for sure.

...

NOTE: the docstring doesn't make it clear that a view is created:

...
...
...
help(numpy.flipud) Help on function flipud in module numpy.lib.twodim_base:

flipud(m) returns an array with the columns preserved and rows flipped in the up/down direction. Works on the first dimension of m.

NOTE2: Maybe these kinds of functions should have an optional flag that specified whether you want a view or a copy -- I'd have expected a copy in this case!

Well, it seems like in most cases one does not need to care whether one is looking at a view or an array. The only time that comes to mind is when you're attempting to modify the array in-place, e.g. a[<something>] = <something else> Even if the maybe-bug above were easily fixable (again, not sure about that), you might *still* want to be able to figure out if a were a view before such a modification. Whether this needs a runtime 'is_view' method, or just consistent documentation about what returns a view, isn't clear to me. Certainly the latter couldn't hurt.

...

QUESTION: How do you tell if two arrays are views on the same data: is checking if they have the same .base reliable?

...
...
...
a = numpy.array((1,2,3,4)) b = a.view() a.base is b.base False

No, I guess not. Maybe .base should return self if it's the originator of the data.

Is there a reliable way? I usually just test by changing a value in one to see if it changes in the other, but that's one heck of kludge!

...
...
...
a.__array_interface__['data'][0] == b.__array_interface__['data'] [0] True

seems to work, but that's pretty ugly!

Good question. As I mentioned above, I assume that this information is tracked internally to prevent the 'original' array data from being deleted before any views have; however I really don't know how it is exposed. Zach

Timothy Hochberg

7:24 p.m.

On 2/1/07, Zachary Pincus wrote: [CHOP] I think that this is is unquestionably a bug It's not a bug. It's a design decision. It has certain consequences. Many good, some bad and some that just take some getting used to. -- isn't the point of

...

views that the user shouldn't need to care if a particular array object is a view or not?

As you state elsewhere, the issue isn't whether a given object is a view per se, it's whether the objects that you are operating on refer to the same block of memory. They could both be views, even of the same object, and as long as they're disjoint, it's not a problem. Given the lack of methods to query whether

...

an array is a view, or what it might be a view on, this seems like a reasonable perspective... I mean, if certain operations produce completely different results when one of the operands is a view, that *seems* like a bug. It might not be worth fixing, but I can't see how that behavior would be considered a feature.

View semantics are a feature. A powerful and sometime dangerous feature. Sometimes the consequences of these semantics can bite people, but that doesn't make them a bug. [CHOP]

...

Good question. As I mentioned above, I assume that this information is tracked internally to prevent the 'original' array data from being deleted before any views have; however I really don't know how it is exposed.

I believe that a reference is held to the original array, so the array itself won't be deleted even if all of the references to it go away. The details may be different, but that's the gist of it. Even ifyou could access this, it wouldn't really tell you anything useful since two slices could refer to pieces of the original chunk of data, yet still be disjoint. If you wanted to be able to figure this out, probably the thing to do is just to actually look at the block of data occupied by each array and see if they overlap. I think you could even do this without resorting to C by using the array interface. However, I'd like to repeat what my doctor said as a kid when I complained that "it hurts when I do this": "Don't do that!" -- Some Radom Doctor In other words, I think you'd be better off restructuring your code so that this isn't an issue. I've been using Numeric/numarray/numpy for over ten years now this has never been a significant issue for me. -- //=][=\\ tim.hochberg@ieee.org

Christopher Barker

8:12 p.m.

Zachary Pincus wrote:

...

...
...
I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array) I'm not sure what this means exactly.

Say one wants to keep two different variables referencing a single in- memory list, as so: a = [1,2,3] b = a Now, if 'b' and 'a' go to live in different places (different class instances or whatever) but we want 'b' and 'a' to always refer to the same in-memory object, so that 'id(a) == id(b)', we need to make sure to not assign a brand new list to either one.

OK, got it, but numpy arrays are not quite the same as lists, there is the additional complication that two different array objects can share the same data:

...

...
...
b = a[:] a = N.ones((5,)) b = a[:] a is b False a[2] = 5 a array([ 1., 1., 5., 1., 1.]) b array([ 1., 1., 5., 1., 1.])

This is very useful, but can be tricky. In a way, it's like a nested list:

...

...
...
a = [[1,2,3,4]] b = [a[0]] a is b False a[0][2] = 5 a [[1, 2, 5, 4]] b [[1, 2, 5, 4]]

hey! changing a changed b too! So key is that in your case, it probably doesn't matter if a and b are the same object, as long as they share the same data, and having multiple arrays sharing the same data is a common idiom in numpy.

...

That is, if we do something like 'a = [i + 1 for i in a]' then 'id (a) != id(b)'. However, we can do 'a[:] = [i + 1 for i in a]' to modify a in-place.

Ah, but at Travis pointed out, the difference is not in assignment or anything like that, but in the fact that a list comprehension produces a copy, which is analogous to : flipud(a).copy In numpy, you DO need to be aware of when you are getting copies, and when you are getting views, and what the consequences are. So really, the only "bug" here is in the docs -- they should make it clear whether a function returns a copy or a view. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Travis Oliphant

7:30 p.m.

Zachary Pincus wrote:

...

Hello folks,

I recently was trying to write code to modify an array in-place (so as not to invalidate any references to that array) via the standard python idiom for lists, e.g.:

a[:] = numpy.flipud(a)

Now, flipud returns a view on 'a', so assigning that to 'a[:]' provides pretty strange results as the buffer that is being read (the view) is simultaneously modified. Here is an example:

This is a known feature of the "view" concept. It has been present in Numeric from the beginning. Performing operations in-place using a view always gives "hard-to-predict" results. It depends completely on how the algorithms are implemented. Knowing that numpy.flipud(a) is just a different way to write a[::-1,...] which works for any nested-sequence, helps you realize that if a is already an array, then it returns a reversed view, but when copied back into itself creates the results you obtained but might not have bee expecting. You can understand the essence of what is happening with a simpler example: a = arange(10) a[:] = a[::-1] What is a? It is easy to see the answer when you realize that the code is doing the equivalent of a[0] = a[9] a[1] = a[8] a[2] = a[7] a[3] = a[6] a[4] = a[5] a[5] = a[4] a[6] = a[3] a[7] = a[2] a[8] = a[1] a[9] = a[0] Notice that the final 5 lines are completely redundant, so really all that is happening is a[:5] = a[5:][::-1] There was an explicit warning of the oddities of this construct in the original Numeric documentation. Better documentation of the flipud function to indicate that it returns a view is definitely desireable. In fact, all functions that return views should be clear about this in the docstring. In addition, all users of "in-place" functionality of NumPy must be aware of the view concept and realize that you could be modifying the array you are using. This came up before when somebody asked how to perform a "diff" in place and I was careful to make sure and not change the input array before it was used.

...

A question, then: Does this represent a bug? Or perhaps there is a better idiom for modifying an array in-place than 'a[:] = ...'? Or is incumbent on the user to ensure that any time an array is directly modified, that the modifying array is not a view of the original array?

Yes, it is and has always been incumbent on the user to ensure that any time an array is directly modified in-place that the modifying array is not a "view" of the original array. Good example... -Travis

Zachary Pincus

8:01 p.m.

...

...
A question, then: Does this represent a bug? Or perhaps there is a better idiom for modifying an array in-place than 'a[:] = ...'? Or is incumbent on the user to ensure that any time an array is directly modified, that the modifying array is not a view of the original array?

Yes, it is and has always been incumbent on the user to ensure that any time an array is directly modified in-place that the modifying array is not a "view" of the original array.

Fair enough. Now, how does a user ensure this -- say someone like me, who has been using numpy (et alia) for a couple of years, but clearly not long enough to have an 'intuitive' feel for every time something might be a view (a feeling that must seem quite natural to long-time numpy users, who may have forgotten precisely how long it takes to develop that level of intuition)? Documentation of what returns views helps, for sure. Would any other 'training' mechanisms help? Say a function that (despite Tim's pretty reasonable 'don't do that' warning) will return true when two arrays have overlapping memory? Or an 'inplace_modify' function that takes the time to make that check? Perhaps I'm the first to have views bite me in this precise way. However, if there are common failure-modes with views, I hope it's not too unreasonable to ask about ways that those common problems might be addressed. (Other than just saying "train for ten years, and you too will have numpy-fu, my son.") Giving newbies tools to deal with common problems with admittedly "dangerous" constructs might be useful. Zach

Travis Oliphant

8:15 p.m.

Zachary Pincus wrote:

...

...
...
A question, then: Does this represent a bug? Or perhaps there is a better idiom for modifying an array in-place than 'a[:] = ...'? Or is incumbent on the user to ensure that any time an array is directly modified, that the modifying array is not a view of the original array?

Yes, it is and has always been incumbent on the user to ensure that any time an array is directly modified in-place that the modifying array is not a "view" of the original array.

Fair enough. Now, how does a user ensure this -- say someone like me, who has been using numpy (et alia) for a couple of years, but clearly not long enough to have an 'intuitive' feel for every time something might be a view (a feeling that must seem quite natural to long-time numpy users, who may have forgotten precisely how long it takes to develop that level of intuition)?

Basically, red-flags go off when you do in-place modification of any kind and you make sure you don't have an inappropriate view. That pretty much describes my "intuition." Views arise from "slicing" notation. The flipud returning a view is a bit obscure and should be documented better.

...

Documentation of what returns views helps, for sure. Would any other 'training' mechanisms help? Say a function that (despite Tim's pretty reasonable 'don't do that' warning) will return true when two arrays have overlapping memory? Or an 'inplace_modify' function that takes the time to make that check?

I thought I had written a function that would see if two input arrays have over-lapping memory, but maybe not. It's not hard for a contiguous chunk of memory, but for two views it's a harder function to write. It's probably a good idea to have such a thing, however. -Travis

Christopher Barker

8:39 p.m.

Zachary Pincus wrote:

...

Say a function that (despite Tim's pretty reasonable 'don't do that' warning) will return true when two arrays have overlapping memory?

I think it would be useful, even if it's not robust. I'd still like to know if a given two arrays COULD share data. I suppose to really be robust, what I'd really want to know is if a given array shares data with ANY other array, i.e. could changing this mess something up? -- but I'm pretty sure that is next to impossible -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Travis Oliphant

8:46 p.m.

Christopher Barker wrote:

...

Zachary Pincus wrote:

...
Say a function that (despite Tim's pretty reasonable 'don't do that' warning) will return true when two arrays have overlapping memory?

I think it would be useful, even if it's not robust. I'd still like to know if a given two arrays COULD share data.

I suppose to really be robust, what I'd really want to know is if a given array shares data with ANY other array, i.e. could changing this mess something up? -- but I'm pretty sure that is next to impossible

Yeah, we don't keep track of who has a reference to a particular array. They only way to get that information would be to walk through all the Objects defined and see if any of them share memory with me. You can sometimes get away with it by looking at the reference count of the object. But, the reference count is used in more ways than that and so it's a very conservative check. In the array interface I'm proposing for inclusion into Python, an object that shares memory could define a "call-back" function that (if defined) would be called when the view to the memory was released. That way objects could store information regarding how many "views" they have extant. -Travis

Timothy Hochberg

8:54 p.m.

On 2/1/07, Christopher Barker wrote:

...

Zachary Pincus wrote:

...
Say a function that (despite Tim's pretty reasonable 'don't do that' warning) will return true when two arrays have overlapping memory?

I think it would be useful, even if it's not robust. I'd still like to know if a given two arrays COULD share data.

I suppose to really be robust, what I'd really want to know is if a given array shares data with ANY other array, i.e. could changing this mess something up? -- but I'm pretty sure that is next to impossible

It's not totally impossible in theory -- languages like Haskell and Clean (which I'm playing with now) manage to use arrays that get updated without copying, while still maintaining the illusion that everything is constant and thus you can't mess up any other arrays. While it's fun to play with and Clean is allegedly pretty fast, it takes quite a bit of work to wrap ones head around. In a language like Python I expect that it would be pretty hard to come up with something useful. Most of the checks would probably be too conservative and thus not useful -tim -- //=][=\\ tim.hochberg@ieee.org

6293

Age (days ago)

6293

Last active (days ago)

List overview

Download

11 comments

5 participants

participants (5)

Anne Archibald
Christopher Barker
Timothy Hochberg
Travis Oliphant
Zachary Pincus

array copy-to-self and views

Zachary Pincus

Zachary Pincus

Zachary Pincus

tags

participants (5)