Unexpected behavior with numpy array
Good day, Reversing a 1-dimensional array in numpy is simple, A = A[:,:,-1] . However A is a new array referring to the old one and is no longer contiguous. While trying to reverse an array in place and keep it contiguous, I encountered some weird behavior. The reason for keeping it contiguous is the array must be passed to an old C function I have, which expects the buffer to be in row major order and contiguous. I am using lots of memory so I want to minimize copying and allocation of new arrays.
A=numpy.arange(0,10) A array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) A[::-1] array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) A[:] = A[::-1] A array([9, 8, 7, 6, 5, 5, 6, 7, 8, 9])
Is there any way to perform assignments of the following form X[sliceI] = expression involving X with dimensions that are compatible with the left hand side without causing the odd behavior I mentioned. If not, it might be helpful to throw an exception when both the LHS and the RHS of an assignment reference an array slice of the same variable? On similar note, does the assignment A = A * B create a new array with a new buffer to hold the result of A * B, and assign A to refer to the new array? Thanks very much! Damian
On Sun, Feb 03, 2008 at 12:25:56PM -0700, Damian Eads wrote:
On similar note, does the assignment
A = A * B
create a new array with a new buffer to hold the result of A * B, and assign A to refer to the new array?
Yes. Without a JIT, Python cannot know that A is present both on the RHS and on the LHS of the equation. If you want to modify in place A, you can do A *= B HTH, Gaƫl
On 03/02/2008, Damian Eads <eads@soe.ucsc.edu> wrote:
Good day,
Reversing a 1-dimensional array in numpy is simple,
A = A[:,:,-1] .
However A is a new array referring to the old one and is no longer contiguous.
While trying to reverse an array in place and keep it contiguous, I encountered some weird behavior. The reason for keeping it contiguous is the array must be passed to an old C function I have, which expects the buffer to be in row major order and contiguous. I am using lots of memory so I want to minimize copying and allocation of new arrays.
The short answer is that reversing an array in-place requires a certain amount of care, in C - you have to explicitly walk through swapping element i and element n-i, using a temporary variable. Getting numpy to exchange the elements in-place is going to be a real pain. I suggest trying a copying method first, and only getting fancier if it's too slow.
A=numpy.arange(0,10) A array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) A[::-1] array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) A[:] = A[::-1] A array([9, 8, 7, 6, 5, 5, 6, 7, 8, 9])
Is there any way to perform assignments of the following form
X[sliceI] = expression involving X with dimensions that are compatible with the left hand side
without causing the odd behavior I mentioned. If not, it might be helpful to throw an exception when both the LHS and the RHS of an assignment reference an array slice of the same variable?
This is, odd as it seems, intended behaviour. Sometimes it's really useful to use the same array on the LHS and RHS: for example A[:-1]=A[1:] You just need to know how the operation is done under the hood (the arrays are iterated over in index order). (Actually, I'm not totally sure this is specified - under some circumstances numpy may iterate over dimensions in an order based on stride and/or size; whether this can affect the result of an operation like the above I'll have to think about.) In any case, much code depends on this. Tricks for getting the array backwards without copying... hmm. Well, you might be able to fill in the array in an unconventional order: A = N.arange(n-1,-1,-1) A=A[::-1] N.sin(A,A) # or whatever Now the reversal of A is a perfectly normal C array (though numpy may not realize this; don't trust the flags, check the strides and sizes). Or you could just write a little C function inplace_reverse(A); if you're already linking to C this shouldn't add too much complexity to your project. Or you could do it explicitly in numpy: for i in xrange(n/2): t = A[i] A[i]=A[n-i] A[n-i]=t In the likely case that this is too slow, you can do the copying in blocks, small enough that memory consumption is moderate but large enough that the python overhead is not too much. Finally, I realize that digging around in legacy code can be miserable, but it is often not really very difficult to make a C function handle strided data - the whole principle of numpy is that compiled code really just needs to know the start address, data type, and the spacing and length of an array along each dimension. Anne
Thanks Anne for your very informative response. Anne Archibald wrote:
On 03/02/2008, Damian Eads <eads@soe.ucsc.edu> wrote:
Good day,
Reversing a 1-dimensional array in numpy is simple,
A = A[:,:,-1] .
However A is a new array referring to the old one and is no longer contiguous.
While trying to reverse an array in place and keep it contiguous, I encountered some weird behavior. The reason for keeping it contiguous is the array must be passed to an old C function I have, which expects the buffer to be in row major order and contiguous. I am using lots of memory so I want to minimize copying and allocation of new arrays.
The short answer is that reversing an array in-place requires a certain amount of care, in C - you have to explicitly walk through swapping element i and element n-i, using a temporary variable. Getting numpy to exchange the elements in-place is going to be a real pain. I suggest trying a copying method first, and only getting fancier if it's too slow.
I was using the copying method on my smaller data set. When I run the same code on the larger data set, I get "Out of Memory" errors.
A=numpy.arange(0,10) A array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) A[::-1] array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) A[:] = A[::-1] A array([9, 8, 7, 6, 5, 5, 6, 7, 8, 9])
Is there any way to perform assignments of the following form
X[sliceI] = expression involving X with dimensions that are compatible with the left hand side
without causing the odd behavior I mentioned. If not, it might be helpful to throw an exception when both the LHS and the RHS of an assignment reference an array slice of the same variable?
This is, odd as it seems, intended behaviour. Sometimes it's really useful to use the same array on the LHS and RHS: for example
A[:-1]=A[1:]
Agreed, this way of in-place shifting is useful, and we certainly would not want to forbid it.
You just need to know how the operation is done under the hood (the arrays are iterated over in index order). (Actually, I'm not totally sure this is specified - under some circumstances numpy may iterate over dimensions in an order based on stride and/or size; whether this can affect the result of an operation like the above I'll have to think about.) In any case, much code depends on this.
Tricks for getting the array backwards without copying... hmm. Well, you might be able to fill in the array in an unconventional order:
A = N.arange(n-1,-1,-1) A=A[::-1] N.sin(A,A) # or whatever
Indeed, building an index array in descending order is one way to fill an array in reverse order. The array I am dealing with is not generated from arange but is generated from another numerical routine, which is significantly more complicated. arange was used in the example to give a simple reproduction of the error.
Now the reversal of A is a perfectly normal C array (though numpy may not realize this; don't trust the flags, check the strides and sizes).
Agreed -- that reversals are easy to implement in C. To disclose more: I need much more than reversals so I tried to see how well numpy supports in-place algorithms, in general. If it was well-supported, I figured I could save myself some time from writing a whole bunch of other C code, and write more succinct numpy code instead. The reversal code was just a small experiment. I should have given the idea more thought -- that many in-place algorithms can be very non-trivial to vectorize. Thus, it is unreasonable to expect numpy array slicing to generally support the safe implementation of vectorized, in-place algorithms. The fact that sometimes weird behavior occurs is a bit concerning but I guess there are always dangers that come with the flexibility offered by Python.
Or you could do it explicitly in numpy: for i in xrange(n/2): t = A[i] A[i]=A[n-i] A[n-i]=t
I'm trying to avoid using Python 'for' loops; there is too much data.
In the likely case that this is too slow, you can do the copying in blocks, small enough that memory consumption is moderate but large enough that the python overhead is not too much.
One of the reasons why I chose numpy/Python was that it offers a succinct syntax. A block-based approach would work, but with less succinctness and readability than the C function.
Finally, I realize that digging around in legacy code can be miserable, but it is often not really very difficult to make a C function handle strided data - the whole principle of numpy is that compiled code really just needs to know the start address, data type, and the spacing and length of an array along each dimension.
This concept of striding an array buffer passed from some higher level language is not new to numpy/Python though. There are potentially additional costs when more complicated non-contiguous striding is used like page faults and additional arithmetic for computing complex indexes. Adding one to an index or incrementing a buffer pointer is generally more readable and probably requires less computation. I agree that in general it's not difficult to use striding but the larger the legacy code base, the more changes that might be needed, and the more likely bugs or bottlenecks will be introduced. As the number of bugs increases, they become more difficult to pin down. Thus, I'm generally a bit cautious when considering such changes to legacy code. Here's another question: is there any way to construct a numpy array and specify the buffer address where it should store its values? I ask because I would like to construct numpy arrays that work on buffers that come from mmap. Thanks again for your comments and help! Damian Eads --- University of California, Santa Cruz http://www.soe.ucsc.edu/~eads
Damian Eads wrote:
Here's another question: is there any way to construct a numpy array and specify the buffer address where it should store its values? I ask because I would like to construct numpy arrays that work on buffers that come from mmap.
Can you clarify that a little? By "buffer" do you mean a Python buffer() object? By "mmap" do you mean Python's mmap in the standard library? numpy has a memmap class which subclasses ndarray to wrap a mmapped file. It handles the opening and mmapping of the file itself, but it could be subclassed to override this behavior to take an already opened mmap object. In general, if you have a buffer() object, you can make an array from it using numpy.frombuffer(). This will be a standard ndarray and won't have the conveniences of syncing to disk that the memmap class provides. If you don't have a buffer() object, but just have a pointer allocated from some C code, then you *could* fake an object which exposes the __array_interface__() method to describe the memory. The numpy.asarray() constructor will use that to make an ndarray object that uses the specified memory. This is advanced stuff and difficult to get right because of memory ownership and object lifetime issues. If you can modify the C code, it might be easier for you to have numpy allocate the memory, then make the C code use that pointer to do its operations. But look at numpy.memmap first and see if it fits your needs. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Robert Kern wrote:
Damian Eads wrote:
Here's another question: is there any way to construct a numpy array and specify the buffer address where it should store its values? I ask because I would like to construct numpy arrays that work on buffers that come from mmap.
Can you clarify that a little? By "buffer" do you mean a Python buffer() object?
Yes, I mean the .data field of a numpy array, which is a buffer object, and points to the memory where an array's values are stored.
By "mmap" do you mean Python's mmap in the standard library?
I actually was referring to the C Standard Library's mmap. My intention was to use a pointer returned by C-mmap as the ".data" buffer to store array values.
numpy has a memmap class which subclasses ndarray to wrap a mmapped file. It handles the opening and mmapping of the file itself, but it could be subclassed to override this behavior to take an already opened mmap object.
This may satisfy my needs. I'm going to look into it and get back to you.
In general, if you have a buffer() object, you can make an array from it using numpy.frombuffer(). This will be a standard ndarray and won't have the conveniences of syncing to disk that the memmap class provides.
This is good to know because there have been a few situations when this would have been very useful. Suppose I do something like (in Python): import ctypes mylib = ctypes.CDLL('libmylib.so') y = mylib.get_float_array_from_c_function() which returns a float* as a Python int, and then I do nelems = mylib.get_float_array_num_elems() x = numpy.frombuffer(ctypes.c_buffer(y), 'float', nelems) This gives me an ndarray x with its (.data) buffer pointing to the memory address give by y. When the ndarray x is no longer referenced (even as another array's base), does numpy attempt to free the memory pointed to by y? In other words, does numpy always deallocate the (.data) buffer in the __del__ method? Or, does fromarray set a flag telling it not to?
If you don't have a buffer() object, but just have a pointer allocated from some C code, then you *could* fake an object which exposes the __array_interface__() method to describe the memory. The numpy.asarray() constructor will use that to make an ndarray object that uses the specified memory. This is advanced stuff and difficult to get right because of memory ownership and object lifetime issues.
Allocating memory in C code would be very useful for me. If I were to use such a numpy.asarray() function (seems the frombuffer you mentioned would also work as described above), it makes sense for the C code to be responsible for deallocating the memory, not numpy. I understand that I would need to ensure that the deallocation happens only when the containing ndarray is no longer referenced anywhere in Python (hopefully, ndarray's finalization code does not need access to the .data buffer).
If you can modify the C code, it might be easier for you to have numpy allocate the memory, then make the C code use that pointer to do its operations.
But look at numpy.memmap first and see if it fits your needs.
Will do! Thanks for the pointers! Damian
Damian Eads wrote:
Robert Kern wrote:
Damian Eads wrote:
Here's another question: is there any way to construct a numpy array and specify the buffer address where it should store its values? I ask because I would like to construct numpy arrays that work on buffers that come from mmap. Can you clarify that a little? By "buffer" do you mean a Python buffer() object?
Yes, I mean the .data field of a numpy array, which is a buffer object, and points to the memory where an array's values are stored.
Actually, the .data field is always constructed by ndarray; it is never provided *to* ndarray even if you construct the ndarray from a buffer object. The buffer object's information is interpreted to construct the ndarray object and then the original buffer object is ignored. The .data attribute will be constructed "on-the-fly" when it is requested. In [9]: from numpy import * In [10]: s = 'aaaa' In [11]: b = buffer(s) In [12]: a = frombuffer(b, dtype=int32) In [13]: a.data is b Out[13]: False In [14]: d1 = a.data In [15]: d2 = a.data In [16]: d1 is d2 Out[16]: False
By "mmap" do you mean Python's mmap in the standard library?
I actually was referring to the C Standard Library's mmap. My intention was to use a pointer returned by C-mmap as the ".data" buffer to store array values.
numpy has a memmap class which subclasses ndarray to wrap a mmapped file. It handles the opening and mmapping of the file itself, but it could be subclassed to override this behavior to take an already opened mmap object.
This may satisfy my needs. I'm going to look into it and get back to you.
In general, if you have a buffer() object, you can make an array from it using numpy.frombuffer(). This will be a standard ndarray and won't have the conveniences of syncing to disk that the memmap class provides.
This is good to know because there have been a few situations when this would have been very useful.
Suppose I do something like (in Python):
import ctypes mylib = ctypes.CDLL('libmylib.so') y = mylib.get_float_array_from_c_function()
which returns a float* as a Python int, and then I do
nelems = mylib.get_float_array_num_elems() x = numpy.frombuffer(ctypes.c_buffer(y), 'float', nelems)
This gives me an ndarray x with its (.data) buffer pointing to the memory address give by y. When the ndarray x is no longer referenced (even as another array's base), does numpy attempt to free the memory pointed to by y? In other words, does numpy always deallocate the (.data) buffer in the __del__ method? Or, does fromarray set a flag telling it not to?
By default, frombuffer() creates an array that is flagged as not owning the data. That means it will not delete the data memory when the ndarray object is destroyed. In [69]: import ctypes In [70]: ca = (ctypes.c_int*8)() In [71]: a = frombuffer(ci, int) In [72]: a Out[72]: array([0, 0, 0, 0, 0, 0, 0, 0]) In [73]: a.flags Out[73]: C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
If you don't have a buffer() object, but just have a pointer allocated from some C code, then you *could* fake an object which exposes the __array_interface__() method to describe the memory. The numpy.asarray() constructor will use that to make an ndarray object that uses the specified memory. This is advanced stuff and difficult to get right because of memory ownership and object lifetime issues.
Allocating memory in C code would be very useful for me. If I were to use such a numpy.asarray() function (seems the frombuffer you mentioned would also work as described above),
Yes, if you can create the buffer object or something that obeys the buffer protocol. ctypes arrays work fine; ctypes pointers don't.
it makes sense for the C code to be responsible for deallocating the memory, not numpy. I understand that I would need to ensure that the deallocation happens only when the containing ndarray is no longer referenced anywhere in Python (hopefully, ndarray's finalization code does not need access to the .data buffer).
My experience has been that this is fairly difficult to do. If you have *complete* control of the ndarray object over its entire lifetime, then this is reasonable. If you don't, then you are going to run into (nondeterministic!) segfaulting bugs eventually. For example, if you are only using it as a temporary inside a function and never return it, this is fine. You will also need to be very careful about constructing views from the ndarray; these will need to be controlled, too. You will have a bug if you delete myarray but return reversed_array=myarray[::-1], for example. I see that you are using ctypes. Be sure to take a look at the .ctypes attribute on ndarrays. This allows you to get a ctypes pointer object from an array. This might help you use numpy to allocate the memory and pass that in to your C functions. In [47]: a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)) Out[47]: <ctypes.LP_c_long object at 0x1c7c800> http://www.scipy.org/Cookbook/Ctypes -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Damian Eads wrote:
This is good to know because there have been a few situations when this would have been very useful.
Suppose I do something like (in Python):
import ctypes mylib = ctypes.CDLL('libmylib.so') y = mylib.get_float_array_from_c_function()
which returns a float* as a Python int, and then I do
nelems = mylib.get_float_array_num_elems() x = numpy.frombuffer(ctypes.c_buffer(y), 'float', nelems)
This gives me an ndarray x with its (.data) buffer pointing to the memory address give by y. When the ndarray x is no longer referenced (even as another array's base), does numpy attempt to free the memory pointed to by y? In other words, does numpy always deallocate the (.data) buffer in the __del__ method? Or, does fromarray set a flag telling it not to?
NumPy won't free the memory unless the OWNDATA flag is set. Look at the flags attribute. Frombuffer creates arrays that don't own there own data, so you are safe. A reference is kept in the NumPy array to the buffer object so it won't be deleted. The ctypes.c_buffer is a new one for me. But, it looks like that would work. NumPy is pretty useful for wrapping raw pointers to memory and then playing with the data inside of Python however you would like. The extended data-types make it very easy to do simple things with large data sets. It's one of the not as widely understood features of NumPy. -Travis O.
Travis E. Oliphant wrote:
Damian Eads wrote:
This is good to know because there have been a few situations when this would have been very useful.
Suppose I do something like (in Python):
import ctypes mylib = ctypes.CDLL('libmylib.so') y = mylib.get_float_array_from_c_function()
which returns a float* as a Python int, and then I do
nelems = mylib.get_float_array_num_elems() x = numpy.frombuffer(ctypes.c_buffer(y), 'float', nelems)
This gives me an ndarray x with its (.data) buffer pointing to the memory address give by y. When the ndarray x is no longer referenced (even as another array's base), does numpy attempt to free the memory pointed to by y? In other words, does numpy always deallocate the (.data) buffer in the __del__ method? Or, does fromarray set a flag telling it not to?
NumPy won't free the memory unless the OWNDATA flag is set. Look at the flags attribute. Frombuffer creates arrays that don't own there own data, so you are safe. A reference is kept in the NumPy array to the buffer object so it won't be deleted.
The ctypes.c_buffer is a new one for me.
Unfortunately, it doesn't work the way it is used in Damian's example. It is a deprecated alias for create_string_buffer(), which creates a ctypes array of c_char from a string or a size. It does not make a buffer object from a ctypes pointer object or an address. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
participants (5)
-
Anne Archibald -
Damian Eads -
Gael Varoquaux -
Robert Kern -
Travis E. Oliphant