
Hallo! I have the following problem: I get a data array in column major storage order and want to use it as numpy array without copying data. Therefore I do the following (2D example): obj = PyArray_FromDimsAndData(2, dim0, PyArray_DOUBLE, (char*)data); PyArrayObject *tmp = (PyArrayObject*)obj; tmp->flags = NPY_FARRAY; In python I can see that the flag F_CONTIGUOUS is true, but the python numpy array is still displayed in the wrong (C-style) order, same as if I don't set the NPY_FARRAY flag. Do I have to consider something more - or are there any other problems ? (sorry - I'm new to numpy ;) Thanks for any hint, LG Georg

Hallo! I found now a way to get the data:
Therefore I do the following (2D example):
obj = PyArray_FromDimsAndData(2, dim0, PyArray_DOUBLE, (char*)data); PyArrayObject *tmp = (PyArrayObject*)obj; tmp->flags = NPY_FARRAY;
if in that example I also change the strides: int s = tmp->strides[1]; tmp->strides[0] = s; tmp->strides[1] = s * dim0[0]; Then I get in python the fortran-style array in right order. However, I don't know if this is the usual way or if it has a performance overhead ... Thanks, LG Georg

Georg Holzmann wrote:
Hallo!
I found now a way to get the data:
Therefore I do the following (2D example):
obj = PyArray_FromDimsAndData(2, dim0, PyArray_DOUBLE, (char*)data); PyArrayObject *tmp = (PyArrayObject*)obj; tmp->flags = NPY_FARRAY;
if in that example I also change the strides:
int s = tmp->strides[1]; tmp->strides[0] = s; tmp->strides[1] = s * dim0[0];
Then I get in python the fortran-style array in right order.
However, I don't know if this is the usual way or if it has a performance overhead ...
This depends on what you are trying to do, but generally, I find that if you can afford it memory-wise, it is much faster to just get a C contiguous array if you treat your C array element per element. If you don't access element per element, then it can become much more difficult, of course (specially if you have to use several times the same parts of the memory). cheers, David

Hallo!
This depends on what you are trying to do, but generally, I find that if you can afford it memory-wise, it is much faster to just get a C contiguous array if you treat your C array element per element. If you
Yes, but the problem is that this data is very big (up to my memory limit) and in python I only want to read it for debugging purpose. And if I want to make a copy to a C contiguous array I can always do it like: my_cstyle_array = my_method().copy() Thanks, LG Georg

On 26/10/2007, Georg Holzmann <grh@mur.at> wrote:
if in that example I also change the strides:
int s = tmp->strides[1]; tmp->strides[0] = s; tmp->strides[1] = s * dim0[0];
Then I get in python the fortran-style array in right order.
This is the usual way. More or less, at least. numpy is designed from the start to handle arrays with arbitrary striding; this is how slices are implemented, for example. There will be no major performance hit from numpy code itself. The actual organization of data in memory will of course affect the speed at which your code runs. The flags, as you discovered, are just a performance optimization, so that code that needs arrays organized as C- or FORTRAN-standard doesn't need to check the strides every time. I don't think numpy's loops - for example in ones((100,100))+eye(100) - are smart about doing operations in an order that makes cache-coherent use of memory. The important exception is the loops that use ATLAS, which I think is mostly the dot() function. Anne

Anne Archibald wrote:
On 26/10/2007, Georg Holzmann <grh@mur.at> wrote:
if in that example I also change the strides:
int s = tmp->strides[1]; tmp->strides[0] = s; tmp->strides[1] = s * dim0[0];
Then I get in python the fortran-style array in right order.
This is the usual way. More or less, at least. numpy is designed from the start to handle arrays with arbitrary striding; this is how slices are implemented, for example. There will be no major performance hit from numpy code itself. The actual organization of data in memory will of course affect the speed at which your code runs. The flags, as you discovered, are just a performance optimization, so that code that needs arrays organized as C- or FORTRAN-standard doesn't need to check the strides every time.
I don't think numpy's loops - for example in ones((100,100))+eye(100) - are smart about doing operations in an order that makes cache-coherent use of memory. The important exception is the loops that use ATLAS, which I think is mostly the dot() function.
There is an optimization where-in the inner-loops are done over the dimension with the smallest stride. What other cache-coherent optimizations do you recommend? -Travis

On 26/10/2007, Travis E. Oliphant <oliphant@enthought.com> wrote:
There is an optimization where-in the inner-loops are done over the dimension with the smallest stride.
What other cache-coherent optimizations do you recommend?
That sounds like a very good first step. I'm far from an expert on this sort of thing, but here are a few ideas at random: * internally flattening arrays when this doesn't affect the result (e.g. ones((10,10))+ones((10,10))) * prefetching memory: in a C application I recently wrote, explicitly prefetching data for interpolation cut my runtime by 30%. This includes telling the processor when you're done with data so it can be purged from the cache. * aligning (some) arrays to 8- 16- 32- or 64-byte boundaries so that they divide nicely into cache lines * using MMX/SSE instructions when available * combining ufuncs so that computations can keep the CPU busy while it waits for data to come in from main RAM (I realize that this is properly the domain of numexpr) * using ATLAS- or FFTW-style autotuning to determine the fastest ways to structure computations (again more relevant for significant expressions rather than simple ufuncs) * reducing use of temporaries in the interest of reducing traffic to main memory * openmp parallel operations when this actually speeds up calculation I realize most of these are a lot of work, and some of them are probably in numpy already. Moreover without using an expression parser it's probably not feasible to implement others. But an array language offers the possibility that a runtime can implement all sorts of optimizations without effort on the user's part. Anne
participants (4)
-
Anne Archibald
-
David Cournapeau
-
Georg Holzmann
-
Travis E. Oliphant