Need help for implementing a fast clip in numpy (was slow clip)
Hi, I am finally implementing a C function to replace the current slow implementation of clip in python as promised a few weeks ago. The idea is to implement it as the following: def clip(input, min, max): a = input.copy() putmask(a, a <= min, min) putmask(a, a >= max, max) return a I don't have that much experience in writing general numpy functions, so I was wondering of other people could advise me on the following points. - should I care only about numeric types, eg something where PyArray_ISNUMBER returns true ? - If input is an array of type T1, min a scalar of type T2 and max a scalar of type T3, which type should be the reference, assuming they are all basic numeric types (float, int, etc...). Eg, if T1 is integer, T2 and T3 are float, what am I supposed to do ? Should I use the basic rule as told in numpy doc, and in that case, and do something like this to determine common datatype (pseudo code): int typenum; for each arg in arglist: typenum = PyArray_ObjectType(arg, typenum) - PyArray_PutMask does not assume anything on the datatype of the input array nor this alignement, and treat its binary content as a buffer to char*. Do I need to do the same, or can I just use "normal" pointer arithmetic on arrays which elements can be casted to basic C types ? cheers, David
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va escriure:
Hi,
I am finally implementing a C function to replace the current slow implementation of clip in python as promised a few weeks ago. The idea is to implement it as the following:
def clip(input, min, max): a = input.copy() putmask(a, a <= min, min) putmask(a, a >= max, max) return a
I don't have that much experience in writing general numpy functions, so I was wondering of other people could advise me on the following points.
Sorry, but not real experience writing extensions directly in C. However, you may want to experiment using numexpr for doing what you want. It's relatively easy to extend numexpr and adding a new opcode to its virtual machine. I'm attaching a patch for implementing such a clip routine (only for floating point types, but given the example, it should be straightforward to add support for integers as well). Also, you should note that using the fancy indexing of numpy seems to perform better than the putmask approach. Below are my figures for a small benchmark (also attached) for testing the performance of clip using several approaches: time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596 It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask. Unfortunately, integrating this clip version officially in numexpr doesn't seem realistic as the main authors are trying hard to avoid cluttering the VM with too many opcodes (they don't want to overload the instruction cache of the CPU). Nevertheless, as the implementation of clip in numexpr should be rather optimal, its performance can still be used as a reference for other implementations. HTH and good luck with your own clip implementation, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va escriure:
Hi,
I am finally implementing a C function to replace the current slow implementation of clip in python as promised a few weeks ago. The idea is to implement it as the following:
def clip(input, min, max): a = input.copy() putmask(a, a <= min, min) putmask(a, a >= max, max) return a
I don't have that much experience in writing general numpy functions, so I was wondering of other people could advise me on the following points.
Sorry, but not real experience writing extensions directly in C. However, you may want to experiment using numexpr for doing what you want. It's relatively easy to extend numexpr and adding a new opcode to its virtual machine. I'm attaching a patch for implementing such a clip routine (only for floating point types, but given the example, it should be straightforward to add support for integers as well).
Also, you should note that using the fancy indexing of numpy seems to perform better than the putmask approach. Below are my figures for a small benchmark (also attached) for testing the performance of clip using several approaches:
time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask.
Not on my machine: time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202 Cheers Stéfan
On 10/01/07, Stefan van der Walt <stefan@sun.ac.za> wrote:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va escriure:
Hi,
I am finally implementing a C function to replace the current slow implementation of clip in python as promised a few weeks ago. The idea is to implement it as the following:
def clip(input, min, max): a = input.copy() putmask(a, a <= min, min) putmask(a, a >= max, max) return a
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask.
Not on my machine:
It is almost certain that the relative efficiency of the different numpy methods depends on what fraction of the data actually needs clipping, so different test data (or real data) is likely to favour one over the other. Why not write the algorithm in C? None of the above is relevant for C, where the algorithm should be "traverse the array; for each element, clip it". If speed is a big concern, the inner loop could be customized for each data type, and max() and min() should probably be used. But in any case, it will almost certainly run faster than any of the above solutions. And distributing the C should pose no problems - just toss it in with the zillion other C functions in numpy. Perverse as it sounds, f2py might actually set things up rather tidily, leaving a FORTRAN function of a handful of lines; even the data type switching could be done in python. A. M. Archibald
A. M. Archibald wrote:
Why not write the algorithm in C?
I did just that a while back, for Numeric. I've enclosed the code for reference. Unfortunately, I never did figure out an efficient way to write this sort of thing for all types, so it only does doubles. Also, it does a bunch of special casing for discontiguous vs. contiguous arrays, and clipping to an array vs a scaler for the min and max arguments. Do what you will with the code... look for NumericExtras_fastclip in with the rest of the stuff in there -- much of which is now obsolete -- I haven't touched this is a few years. -Chris NOTE: is there a clean and efficient way to write custom functions of this sort for all types and both contiguous and discontiguous arrays? I guess I'm imaging some smart iterators and a micro version of C++ templates -- or maybe use C++ templates themselves. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov #include "Python.h" #include "Numeric/arrayobject.h" // These are little macros I use to access array values for specific rank arrays #define ARRAYVAL0(aType,a) ( *(aType *)(a->data)) #define ARRAYVAL1(aType,a,i) ( *(aType *)(a->data + (i)*a->strides[0])) #define ARRAYVAL2(aType,a,i,j) ( *(aType *)(a->data + (i)*a->strides[0] + (j)*a->strides[1])) #define ARRAYVAL3(aType,a,i,j,k) ( *(aType *)(a->data + (i)*a->strides[0] + (j)*a->strides[1] + (k)*a->strides[2])) // These are some macros borrowed from arrayobject.c //#define SIZE(mp) (_PyArray_multiply_list((mp)->dimensions, (mp)->nd)) int AreSameShape(PyArrayObject *Array1, PyArrayObject *Array2){ int i; if (Array1->nd != Array2->nd){ return 0; } for (i = 0; i < Array1->nd; i++){ if (Array1->dimensions[i] != Array2->dimensions[i]){ return 0; } } return 1; } int ComputeOffset(PyArrayObject *Array, int *indexes){ int i, offset = 0; for (i = 0; i < Array->nd; i++) { offset += indexes[i]*Array->strides[i]; } return offset; } void ReplaceValue(PyArrayObject *Array, PyArrayObject *Min, PyArrayObject *Max, int *indexes){ // This function does the actual replacement of the values. // it allows Min and Max to be one element, or the same shape as Array // the offsets are computed separately, so as long as they are the same shape // it should work even if some are contiguous, and some not, for example double min, max; int offset; offset = ComputeOffset(Array,indexes); if (PyArray_Size((PyObject *)Min) == 1){ min = ARRAYVAL0(double,Min); } else { min = *(double *)(Min->data + ComputeOffset(Min,indexes)); } if (PyArray_Size((PyObject *)Max) == 1){ max = ARRAYVAL0(double,Max); } else { max = *(double *)(Max->data + ComputeOffset(Max,indexes)); } if (*(double *)(Array->data + offset) > max){ *(double *)(Array->data + offset) = max ; } else if (*(double *)(Array->data + offset) < min){ *(double *)(Array->data + offset) = min; } return; } static char doc_fastclip[] = "fastclip(a,min,max): a is clipped so that there are no values \n" "less than min, or greater than max.\n\n" "It only works on PyArrays of type Float.\n" "min and max can be either scalar or the same size as a.\n" "If a, min, and max are all contiguous (or scalar) then it is a LOT faster\n\n" "Note that if min > max, the results are undetermined"; static PyObject * NumericExtras_fastclip(PyObject *self, PyObject *args) { int NumArray, elementsize, i; int *indexes; PyObject *InMin; PyObject *InMax; PyArrayObject *Array; PyArrayObject *Min; PyArrayObject *Max; if (!PyArg_ParseTuple(args, "O!OO", &PyArray_Type, &Array, &InMin, &InMax)) { return NULL; } // check types of input if (Array->descr->type_num != PyArray_DOUBLE){ PyErr_SetString (PyExc_ValueError, "a must be a NumPy array of type Float"); return NULL; } // convert min and max to double arrays: // if they don't convert, the input values are not valid // also check if they are the right size Min = (PyArrayObject *) PyArray_FromObject(InMin, PyArray_DOUBLE, 0, 0); if (Min == NULL){ PyErr_SetString (PyExc_ValueError, "min must be an object that can be converted to an array of Floats"); return NULL; } if (!((PyArray_Size((PyObject *)Min) == 1) || AreSameShape(Min, Array))){ PyErr_SetString (PyExc_ValueError, "min must be either a scalar or the same size as a"); Py_DECREF(Min); return NULL; } Max = (PyArrayObject *) PyArray_FromObject(InMax, PyArray_DOUBLE, 0, 0); if (Max == NULL){ PyErr_SetString (PyExc_ValueError, "max must be an object that can be converted to an array of Floats"); Py_DECREF(Min); return NULL; } if (!((PyArray_Size((PyObject *)Max) == 1) || AreSameShape(Max, Array))){ PyErr_SetString (PyExc_ValueError, "max must be either a scalar or the same size as a"); Py_DECREF(Min); Py_DECREF(Max); return NULL; } // After all that input checking, switch between the contiguous and non-contiguous cases. if (PyArray_ISCONTIGUOUS(Array) && PyArray_ISCONTIGUOUS(Min) && PyArray_ISCONTIGUOUS(Max)){ //we have the contiguous case // This seems pretty kludgy to have the four cases, but it was the first // thing that came to mind that I was sure would work, and would accomidate // either array or scalar arguments for min and max. // it's also faster than checking if they are scalar inside the loop NumArray = PyArray_Size((PyObject *)Array); elementsize = sizeof(double); if (PyArray_Size((PyObject *)Min) == 1 && PyArray_Size((PyObject *)Max) == 1){ // both limits are scalar for (i = 0; i < NumArray; i++) { // loop over each element if (*(double *)(Array->data + i*elementsize) > ARRAYVAL0(double,Max)){ *(double *)(Array->data + i*elementsize) = ARRAYVAL0(double,Max); } else if (*(double *)(Array->data + i*elementsize) < ARRAYVAL0(double,Min)){ *(double *)(Array->data + i*elementsize) = ARRAYVAL0(double,Min); } } } else if (PyArray_Size((PyObject *)Min) == 1 && PyArray_Size((PyObject *)Max) == NumArray){ // Min is scalar for (i = 0; i < NumArray; i++) { // loop over each element if (*(double *)(Array->data + i*elementsize) > *(double *)(Max->data + i*elementsize)){ *(double *)(Array->data + i*elementsize) = *(double *)(Max->data + i*elementsize) ; } else if (*(double *)(Array->data + i*elementsize) < ARRAYVAL0(double,Min)){ *(double *)(Array->data + i*elementsize) = ARRAYVAL0(double,Min); } } } else if (PyArray_Size((PyObject *)Max) == 1 && PyArray_Size((PyObject *)Min) == NumArray){ // Max is scalar for (i = 0; i < NumArray; i++) { // loop over each element if (*(double *)(Array->data + i*elementsize) > ARRAYVAL0(double,Max)){ *(double *)(Array->data + i*elementsize) = ARRAYVAL0(double,Max); } else if (*(double *)(Array->data + i*elementsize) < *(double *)(Min->data + i*elementsize)){ *(double *)(Array->data + i*elementsize) = *(double *)(Min->data + i*elementsize) ; } } } else if (PyArray_Size((PyObject *)Min) == NumArray && PyArray_Size((PyObject *)Max) == NumArray){ // Neither is scalar for (i = 0; i < NumArray; i++) { // loop over each element if (*(double *)(Array->data + i*elementsize) > *(double *)(Max->data + i*elementsize)){ *(double *)(Array->data + i*elementsize) = *(double *)(Max->data + i*elementsize) ; } else if (*(double *)(Array->data + i*elementsize) < *(double *)(Min->data + i*elementsize)){ *(double *)(Array->data + i*elementsize) = *(double *)(Min->data + i*elementsize) ; } } } else { // The size of Min and/or Max don't match Array: we should never get here, do to previous checks. PyErr_SetString (PyExc_ValueError, "min and max must be either scalar or the same shape as a"); Py_DECREF(Min); Py_DECREF(Max); return NULL; } } else { // this now the non-contiguous case: it is much slower, but presumably could be optimised : suggestions gratefully excepted! indexes = calloc(Array->nd,sizeof(int)); while (1){ ReplaceValue(Array, Min, Max, indexes); i = (Array->nd) - 1; while ((i >= 0) && (indexes[i]+1 == Array->dimensions[i])){ indexes[i] = 0; i -= 1; } if (i<0) { break; } indexes[i] = indexes[i]+1; } free(indexes); } Py_DECREF(Min); Py_DECREF(Max); return Py_BuildValue(""); } static void byte_swap_data(void *p, int n, int size) { /* p is a pointer to the data block n is the size of the data block size is the size of the elements in the data block This code borrowed from arrayobject.c in Numeric It is used by byteswap */ char *a, *b, c; switch(size) { case 2: for (a = (char*)p ; n > 0; n--, a += 1) { b = a + 1; c = *a; *a++ = *b; *b = c; } break; case 4: for (a = (char*)p ; n > 0; n--, a += 2) { b = a + 3; c = *a; *a++ = *b; *b-- = c; c = *a; *a++ = *b; *b = c; } break; case 8: for (a = (char*)p ; n > 0; n--, a += 4) { b = a + 7; c = *a; *a++ = *b; *b-- = c; c = *a; *a++ = *b; *b-- = c; c = *a; *a++ = *b; *b-- = c; c = *a; *a++ = *b; *b = c; } break; default: break; } } static char doc_byteswap[] = "byteswap(m). Swaps the bytes in the contents of array m.\n" "Useful for reading data written by a machine with a different byte order.\n" "Operates on the array in place. Only works on contiguous arrays.\n"; static PyObject * NumericExtras_byteswap(PyObject *self, PyObject *args) { PyArrayObject *Array; if (!PyArg_ParseTuple(args, "O!", &PyArray_Type, &Array)) { return NULL; } if (!PyArray_ISCONTIGUOUS(Array)){ PyErr_SetString (PyExc_ValueError, "m must be contiguous"); return NULL; } if (Array->descr->type_num < PyArray_CFLOAT) { byte_swap_data(Array->data, PyArray_SIZE(Array), Array->descr->elsize); } else { byte_swap_data(Array->data, PyArray_SIZE(Array)*2, Array->descr->elsize/2); } return Py_BuildValue(""); } static char doc_changetype[] = "changetype(m,typecode). Changes the type, at the C level, of the block of\n" "bytes in the data array of m. It does this in place. This is useful for\n" "manipulating binary data that may have been read in from a file, without\n" "making multiple copies of the data.\n\n" "The array returned is always of rank 1" "changetype(m,typecode) should have the same result as:\n\n" "m = fromstring(m.tostring(),typecode)\n\n" "it only works on contiguous arrays\n"; static PyObject * NumericExtras_changetype(PyObject *self, PyObject *args) { PyArrayObject *Array; //int *new_strides; //int *new_dimensions; PyArray_Descr *NewDescription; int TotalBytes; int NewElementSize; char typecode; if (!PyArg_ParseTuple(args, "O!c", &PyArray_Type, &Array, &typecode)) { return NULL; } if (!PyArray_ISCONTIGUOUS(Array)){ PyErr_SetString (PyExc_ValueError, "m must be contiguous"); return NULL; } TotalBytes = PyArray_NBYTES(Array); // This will crash if an invalid typecode is passed in. NewDescription = PyArray_DescrFromType(typecode); NewElementSize = NewDescription->elsize; if (TotalBytes % NewElementSize > 0){ PyErr_SetString (PyExc_ValueError, "The input array is the wrong size for the requested type"); return NULL; } Array->descr = NewDescription; // does the old descr need to be freed?? how?? Array->nd = 1; Array->strides[0] = NewElementSize; Array->dimensions[0] = TotalBytes / NewElementSize; return Py_BuildValue(""); } static PyMethodDef NumericExtrasMethods[] = { {"fastclip", NumericExtras_fastclip, METH_VARARGS, doc_fastclip}, {"byteswap", NumericExtras_byteswap, METH_VARARGS, doc_byteswap}, {"changetype", NumericExtras_changetype, METH_VARARGS, doc_changetype}, {NULL, NULL} /* Sentinel */ }; void initNumericExtras(void){ (void) Py_InitModule("NumericExtras", NumericExtrasMethods); import_array() }
Christopher Barker wrote:
A. M. Archibald wrote:
Why not write the algorithm in C?
I did just that a while back, for Numeric. I've enclosed the code for reference.
Unfortunately, I never did figure out an efficient way to write this sort of thing for all types, so it only does doubles. Also, it does a bunch of special casing for discontiguous vs. contiguous arrays, and clipping to an array vs a scaler for the min and max arguments.
To do the actual clipping if the datatypes are 'native' is trivial in C: a single loop, a comparison, that's it. I have become an irrational C++ hater with the time, so I don't use template (and I don't think C++ is welcomed in core numpy). For those easy case of template use, autogen works well enough for me; my impression is that numpy uses a similar system, eg for ufunc, etc... You can look at scipy/Lib/sandbox/cdavid/src/levinson1d.tpl for an example of autogen to generate function for any datatype you want; if you need more fancy template facilities like partial specialization and other crazy things mere mortals like me will never understand in C++, then I am not sure autogen can be used. I guess the method used in numpy is better to use for core functionalities, as it avoids the burden of installing one more tool for development. Now, I didn't know that clip was supposed to handle arrays as min/max values. At first, I didn't understand the need to care about contiguous/non contiguous; having non scalar for min/max makes it necessary to have special case for non contiguous. But again, it is important not to lose sight... The goal was to have faster clipping for matplotlib, and this cases are easy, because it is native type and scalar min/max, where contiguous or not does not matter as we traverse the input arrays element by element. If we pass non native endian, non contiguous arrays, there is actually a pretty good chance that the current implementation is already fast enough, and does not need to be changed anyway. Thanks for the suggestion and the precisions, David
David Cournapeau wrote:
To do the actual clipping if the datatypes are 'native' is trivial in C: a single loop, a comparison, that's it.
only if it's also Contiguous. From a quick look at your posted code, it looks like it only works for contiguous arrays.
I don't think C++ is welcomed in core numpy
no , it's not. I avoid C and C++ whenever possible, but there looks to be some real promise in C++ templates, etc -- but it does get ugly! I"ve always wanted to give Blitz++ a try...
autogen works well enough for me;
I didn't know about autogen -- that may be all we need.
Now, I didn't know that clip was supposed to handle arrays as min/max values.
one more nifty feature...And if you want to support broadcasting, even more so!
At first, I didn't understand the need to care about contiguous/non contiguous; having non scalar for min/max makes it necessary to have special case for non contiguous.
I'm confused. This issue is that you can't just increment the pointer to get the next element if the array is non-contiguous.. you need to do all the strides, etc, math.
But again, it is important not to lose sight... The goal was to have faster clipping for matplotlib, and this cases are easy, because it is native type and scalar min/max, where contiguous or not does not matter as we traverse the input arrays element by element.
You still have to traverse them properly, and I didn't find a way to do that equally efficiently for contiguous and non-contiguous arrays -- thus I special cased it. That doesn't mean there isn't a a way -- I really don't know much what I'm doing! What you've [posted doesn't look like it will work for non-contiguous arrays:
static int native_scalar_fbl_clip(npy_float* in, npy_intp size, npy_float min, npy_float max) { //npy_float* kcoeff, npy_float* err) npy_intp i;
for (i = 0; i < size; ++i) { if (in[i] <= min) { in[i] = min; } if (in[i] >= max) { in[i] = max; } } return 0; }
This just loops through the whole C array from start to size -- only works for contiguous data.
If we pass non native endian, non contiguous arrays, there is actually a pretty good chance that the current implementation is already fast enough, and does not need to be changed anyway.
good point -- optimizing the important cases, while allowing all cases is a fine way to go. If anyone needs other cases optimized, that can be added later. I'd still like to know if anyone knows how to efficiently loop through all the elements of a non-contiguous array in C. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 1/11/07, Christopher Barker <Chris.Barker@noaa.gov> wrote: [CHOP]
I'd still like to know if anyone knows how to efficiently loop through all the elements of a non-contiguous array in C.
First, it's not that important if the array is contiguous for this sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term). Anyway, the last dimension of the array can be strided without making things more difficult. All you need to be able to do is to address the elements of the array as thedata[offset + stride*index]. That being said, the strategy that the ufuncs use, and possibly other functions as well, is to have the core functions operate only on simply-strided chunks of data. At a higher level, there is some magic that parcels up non-contiguous array into simply-strided chunks and feeds them to core functions. How efficient this is obviously depends on how large the chunks that the magic parceler manages to extract are, but typically it seems to work pretty well. I don't know whether the magic that does the parcelling is available for use by random functions or whether it's specific to the ufunc machinery, I never delved into that end of the ufuncs. -tim
Timothy Hochberg wrote:
On 1/11/07, *Christopher Barker* <Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>> wrote:
[CHOP]
I'd still like to know if anyone knows how to efficiently loop through all the elements of a non-contiguous array in C.
First, it's not that important if the array is contiguous for this sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term). Anyway, the last dimension of the array can be strided without making things more difficult. All you need to be able to do is to address the elements of the array as thedata[offset + stride*index].
I don't understand why we need to do thedata[offset + stride * index] instead of thedata[index] when the data are aligned ? It looks like I seriously misunderstood the meaning of alignement... David
That being said, the strategy that the ufuncs use, and possibly other functions as well, is to have the core functions operate only on simply-strided chunks of data. At a higher level, there is some magic that parcels up non-contiguous array into simply-strided chunks and feeds them to core functions. How efficient this is obviously depends on how large the chunks that the magic parceler manages to extract are, but typically it seems to work pretty well.
I don't know whether the magic that does the parcelling is available for use by random functions or whether it's specific to the ufunc machinery, I never delved into that end of the ufuncs.
-tim
------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On 1/11/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Timothy Hochberg wrote:
On 1/11/07, *Christopher Barker* <Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>> wrote:
[CHOP]
I'd still like to know if anyone knows how to efficiently loop
through
all the elements of a non-contiguous array in C.
First, it's not that important if the array is contiguous for this sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term). Anyway, the last dimension of the array can be strided without making things more difficult. All you need to be able to do is to address the elements of the array as thedata[offset + stride*index].
I don't understand why we need to do thedata[offset + stride * index] instead of thedata[index] when the data are aligned ? It looks like I seriously misunderstood the meaning of alignement...
I think you are confusing aligned for contiguous. Aligned normally means the data occurs on natural address boundaries for the architecture and item size, say multiples of 4 for 32 bit words on 32 bit machines. This contrasts with the case where a word is split across natural boundaries: some architectures support that with decreased performance, others don't. Then there are endianess issues, etc, etc. Anyway, the easiest data to deal with is contiguous data where the data items lie one after the other in (virtual) memory. The next easiest is where the spacing between data elements is fixed, an lastly there is the case of generally strided data. Chuck David
That being said, the strategy that the ufuncs use, and possibly other functions as well, is to have the core functions operate only on simply-strided chunks of data. At a higher level, there is some magic that parcels up non-contiguous array into simply-strided chunks and feeds them to core functions. How efficient this is obviously depends on how large the chunks that the magic parceler manages to extract are, but typically it seems to work pretty well.
I don't know whether the magic that does the parcelling is available for use by random functions or whether it's specific to the ufunc machinery, I never delved into that end of the ufuncs.
-tim
------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Charles R Harris wrote:
On 1/11/07, *David Cournapeau* <david@ar.media.kyoto-u.ac.jp <mailto:david@ar.media.kyoto-u.ac.jp>> wrote:
Timothy Hochberg wrote: > > > On 1/11/07, *Christopher Barker* <Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov> > <mailto:Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>>> wrote: > > [CHOP] > > > I'd still like to know if anyone knows how to efficiently loop through > all the elements of a non-contiguous array in C. > > > First, it's not that important if the array is contiguous for this > sort of thing. What you really care about is whether it's > simply-strided (or maybe single-strided would be a better term). > Anyway, the last dimension of the array can be strided without making > things more difficult. All you need to be able to do is to address the > elements of the array as thedata[offset + stride*index]. I don't understand why we need to do thedata[offset + stride * index] instead of thedata[index] when the data are aligned ? It looks like I seriously misunderstood the meaning of alignement...
I think you are confusing aligned for contiguous. Aligned normally means the data occurs on natural address boundaries for the architecture and item size, say multiples of 4 for 32 bit words on 32 bit machines.
I understand that definition of alignment, but I thought it also implied that the data buffer has an equal spacing between its elements, and this spacing is equal to the size of the type ?
This contrasts with the case where a word is split across natural boundaries: some architectures support that with decreased performance, others don't. Then there are endianess issues, etc, etc. Anyway, the easiest data to deal with is contiguous data where the data items lie one after the other in (virtual) memory. The next easiest is where the spacing between data elements is fixed, So does that mean you can have a numpy array of let's say 8 bytes double, but that each item's address may be 16 bytes one from each other (if we suppose alignment) ?
There are three concepts here: 1: address alignment: to be sure that data pointer is a multiple of the data type (a bit like buffer returned by posix_memalign). 2: data "packing": for an array with elements of size d bytes, for a given element, you get the next one by adding d bytes to the data pointer. 3: style of indexing, eg for a rank 4, what is the function (a, b, c, d) -> i such as array->data[i] = array[a, b, c, d]. So if I understand you correctly, contiguous is about points 2 and 3, and not 3 only as I first thought ? I am confused about the relation between (in bytes) strides, contiguous and alignment... Because when I read the pages 25 to 28 of the numpy ebook about contiguous memory layout, we talk only about items indexing and the relationship between multi-dimensional numpy indexing and one dimension indexing in the actual data buffer (point 3). I thought all numpy arrays were packed... David
On 1/11/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
On 1/11/07, *David Cournapeau* <david@ar.media.kyoto-u.ac.jp <mailto:david@ar.media.kyoto-u.ac.jp>> wrote:
Timothy Hochberg wrote: > > > On 1/11/07, *Christopher Barker* <Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov> > <mailto:Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>>> wrote: > > [CHOP] > > > I'd still like to know if anyone knows how to efficiently loop through > all the elements of a non-contiguous array in C. > > > First, it's not that important if the array is contiguous for this > sort of thing. What you really care about is whether it's > simply-strided (or maybe single-strided would be a better term). > Anyway, the last dimension of the array can be strided without making > things more difficult. All you need to be able to do is to address the > elements of the array as thedata[offset + stride*index]. I don't understand why we need to do thedata[offset + stride *
index]
instead of thedata[index] when the data are aligned ? It looks like
I
seriously misunderstood the meaning of alignement...
I think you are confusing aligned for contiguous. Aligned normally means the data occurs on natural address boundaries for the architecture and item size, say multiples of 4 for 32 bit words on 32 bit machines.
I understand that definition of alignment, but I thought it also implied that the data buffer has an equal spacing between its elements, and this spacing is equal to the size of the type ?
This contrasts with the case where a word is split across natural boundaries: some architectures support that with decreased performance, others don't. Then there are endianess issues, etc, etc. Anyway, the easiest data to deal with is contiguous data where the data items lie one after the other in (virtual) memory. The next easiest is where the spacing between data elements is fixed, So does that mean you can have a numpy array of let's say 8 bytes double, but that each item's address may be 16 bytes one from each other (if we suppose alignment) ?
Well, the common machines are 32 bit and 64 bit, so for instance extended precision (usually 80 bits) ends up as 96 bits (3*32) on the first and 128 (2*64) bits on the second, with the extra bits ignored. The items in c structures will often have empty spaces filling in between them unless specific compiler directives are invoked and the whole will be aligned on the appropriate boundary. For instance, on my machine struct bar { char a; int b; }; is 8 bytes long because the int forces the whole too be aligned on a 4 byte boundary. I have seen 16 byte alignments for altivec (128 bits) on PPC. So on and so forth. It is all very achitecture dependent. But to answer you guestion, alignment is something that describes the location of a *single* item in memory, usually various numbers, but sometimes structures also. Contiguous describes the spacing between items. There are three concepts here:
1: address alignment: to be sure that data pointer is a multiple of the data type (a bit like buffer returned by posix_memalign). 2: data "packing": for an array with elements of size d bytes, for a given element, you get the next one by adding d bytes to the data pointer. 3: style of indexing, eg for a rank 4, what is the function (a, b, c, d) -> i such as array->data[i] = array[a, b, c, d].
So if I understand you correctly, contiguous is about points 2 and 3, and not 3 only as I first thought ?
2 and 3 are pretty much the same thing in c, but not in numpy. I am confused about the relation
between (in bytes) strides, contiguous and alignment... Because when I read the pages 25 to 28 of the numpy ebook about contiguous memory layout, we talk only about items indexing and the relationship between multi-dimensional numpy indexing and one dimension indexing in the actual data buffer (point 3). I thought all numpy arrays were packed...
The underlying data may be contiguous, but the view is not necessarily so. Transpose, for instance, just changes the view, not the storage, likewise, a[::2] returns a view that skips over every other item. Since x = a[::2] is a valid numpy array sharing memory with a, the items of x are not contiguous. Chuck
Charles R Harris wrote:
Well, the common machines are 32 bit and 64 bit, so for instance extended precision (usually 80 bits) ends up as 96 bits (3*32) on the first and 128 (2*64) bits on the second, with the extra bits ignored. The items in c structures will often have empty spaces filling in between them unless specific compiler directives are invoked and the whole will be aligned on the appropriate boundary. For instance, on my machine
struct bar { char a; int b; };
So the data buffer in numpy is not really a C array, but more like a structure ? (I am talking about the actual binary data) In C, I think you always have sizeof(*a) bytes between a[0] and a[i]. That is sizeof(item) and addresses shift always match: we always have a[i] = (char*)a + sizeof(*a) * i. For example, in your example for extended precision, the computation may be done in 80 bits, but sizeof(long double) is 12 on my machine, not 10. There is a match between sizeof and spacing between items; otherwise, I don't see how pointer arithmetic would be possible with arrays in C otherwise ? But this goes quite further from my usual C knowledge, so I may be wrong here too. cheers, David
I think it may have all been cleared up now, but just in case:
a array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],
[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])
a.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So here is the most common case: Contiguous, aligned, etc. This means the data buffer is a "normal C array", and you can iterate through it in the normal C way.
b = a[:,2,:] b array([[ 8, 9, 10, 11], [20, 21, 22, 23]])
So b is a slice pulled out of a, sharing the same data black, but not all of it!
b.flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So it is not longer a Contiguous array. This means that you are looking at the same block of memory as above, but only some of it is used for this array, and you need to use the strides to get at the elements that are part of this array. This is all aside from alignment issues, of which I know nothing.
So the data buffer in numpy is not really a C array, but more like a structure ? (I am talking about the actual binary data)
Is is a C array of bytes, and probably of the given type, but not all of it is used for this particular array object, or, in the case of Fortran ordering, it's used in a different order that a conventional C n-d array.
"for a numpy array a of eg float32, am I guaranteed that a->data[sizeof(float32) * i] for 0 <= i < a.size gives me all the items of a, even for non contiguous arrays ?"
I think you got the answer to this, but in case you didn't: No, only if it's contiguous (I'm not sure about alignment)
First, it's not that important if the array is contiguous for this
sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term)
but how do you get single strided? this is what always made it hard for me to know how to write this kind of code.
Anyway, the last dimension of the array can be strided without making things more difficult. All you need to be able to do is to address the elements of the array as thedata[offset + stride*index].
But how do you make it that simple for n-d arrays? that only works for 1-d. What I have come up with are some macros for the common cases -- 1-d, 2-d, 3-d: #define ARRAYVAL2(aType,a,i,j) ( *(aType *)(a->data + (i)*a->strides[0] + (j)*a->strides[1])) #define ARRAYVAL3(aType,a,i,j,k) ( *(aType *)(a->data + (i)*a->strides[0] + (j)*a->strides[1] + (k)*a->strides[2])) (this was for Numeric -- maybe it's slightly different no) This involves a fair bit of math for each index operation -- it could really slow down a simple function like clip. To do this for the general case, the only thing I could come up with was recursion, which seemed a bit heavy-weight, so I went with looping through all the indices to compute the offset. Ugly and slow.
Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array
Is this iterator as efficient as incrementing the index for contiguous arrays? i.e. is there any point of special casing contiguous arrays if you use it? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Christopher Barker wrote:
Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array
Is this iterator as efficient as incrementing the index for contiguous arrays? i.e. is there any point of special casing contiguous arrays if you use it?
Contiguous arrays are special-cased already (as are 1-d arrays), so not really. There is a little-bit of overhead in testing for the special-case that you can overcome by testing for the special-case outside the looping structure on your own. I'm not sure how much speed gain is accomplished doing that. -Travis
On 1/13/07, Christopher Barker <Chris.Barker@noaa.gov> wrote:
I think it may have all been cleared up now, but just in case:
a array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],
[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])
a.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So here is the most common case: Contiguous, aligned, etc. This means the data buffer is a "normal C array", and you can iterate through it in the normal C way.
b = a[:,2,:] b array([[ 8, 9, 10, 11], [20, 21, 22, 23]])
So b is a slice pulled out of a, sharing the same data black, but not all of it!
b.flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So it is not longer a Contiguous array. This means that you are looking at the same block of memory as above, but only some of it is used for this array, and you need to use the strides to get at the elements that are part of this array.
This is all aside from alignment issues, of which I know nothing.
So the data buffer in numpy is not really a C array, but more like a structure ? (I am talking about the actual binary data)
Is is a C array of bytes, and probably of the given type, but not all of it is used for this particular array object, or, in the case of Fortran ordering, it's used in a different order that a conventional C n-d array.
"for a numpy array a of eg float32, am I guaranteed that a->data[sizeof(float32) * i] for 0 <= i < a.size gives me all the items of a, even for non contiguous arrays ?"
I think you got the answer to this, but in case you didn't:
No, only if it's contiguous (I'm not sure about alignment) According to numpy ebook, you need alignement for deferencing
I finally understood that point, which prevents me from understanding the real issue. I totally forgot about views which are not copies in numpy (contrary to matlab, where my experience in C extension lies), which makes for all those case... I kepts thinking how come you could create a non aligned or a non packed array in numpy, but once you start thinking about views and everything, it all makes sense... So basically, in my case, I do not care about F_CONTIGUOUS vs C_CONTIGUOUS, but I care about contiguity. So my special case is really F_CONTIGUOUS or C_CONTIGUOUS. pointers. But I guess non aligned arrays are not really common.
First, it's not that important if the array is contiguous for this
sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term)
but how do you get single strided? this is what always made it hard for me to know how to write this kind of code.
I am also interested in that. You could imagine that some arrays are neither C or F contiguous, but still are "packed" (using some concatenatation, for example ?), and as such can be iterated easily.
This involves a fair bit of math for each index operation -- it could really slow down a simple function like clip.
Indeed. But again, slower operations on those arrays is to be expected, and speeding them up is not all that important. At least, it is not important for the case I want to speed up.
Is this iterator as efficient as incrementing the index for contiguous arrays? i.e. is there any point of special casing contiguous arrays if you use it?
I think the problem is not that much incrementing the iterator, but more having contiguous memory access. For example, the iterator may not guarantee traversing the array "contiguously" even if the array is contiguous (I didn't look at the sources for the array iterator). If for example you access a C order array but iterating the F way, it will be much slower, because your computation becomes limited by main memory's speed. For some code related to linear prediction coding, I tried to take into account C_CONTIGUOUS vs other cases, and I realised that in most cases, it is not only easier but often faster to copy the data in a new contiguous data buffer and doing the computation on that. Basically, at least for those cases, it didn't worth the effort to handle complicated cases (the memory buffer being not that big). David
On 1/12/07, David Cournapeau <cournape@gmail.com> wrote:
On 1/13/07, Christopher Barker <Chris.Barker@noaa.gov> wrote:
I think it may have all been cleared up now, but just in case:
but how do you get single strided? this is what always made it hard for me to know how to write this kind of code.
<snip> Make a copy of the array. New copies are C_CONTIGUOUS by default. There is also the ascontiguousarray function: ascontiguousarray(a, dtype=None) Return 'a' as an array contiguous in memory (C order). Which makes a copy only when required. Since clip returns a new array anyway, this shouldn't be a problem except when clipping against another array, which you would also want to be contiguous. In practice, I don't think using ascontiguousarray for the clipping array would add much overhead as the array would likely be contiguous in the first place. You would probably want to match data types also. I think the needed ops are already implemented for the types at the c-level, __gt__ for instance, and in that case you can simply use the function pointer with some loss of speed. If you can determine the corresponding c-type, and I think you can, then it shouldn't be too much trouble to implement type specific clipping, but it might not be worth the effort. There are lots of code snippets in other functions similar to what you need that could be a good starting point. You just need to find them 8^) Chuck
On 1/13/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 1/12/07, David Cournapeau <cournape@gmail.com> wrote:
On 1/13/07, Christopher Barker <Chris.Barker@noaa.gov> wrote:
I think it may have all been cleared up now, but just in case:
but how do you get single strided? this is what always made it hard for me to know how to write this kind of code.
<snip>
Make a copy of the array. New copies are C_CONTIGUOUS by default. There is also the ascontiguousarray function:
ascontiguousarray(a, dtype=None) Return 'a' as an array contiguous in memory (C order).
Which makes a copy only when required. Since clip returns a new array anyway,
This is not always true: clip as a array member functions accepts out as a keyword, even is out is the same than input. Something like a.clip(min, max, a) does work. I actually do not understand this out argument. Doing a.clip(min, max, a) for in place is a bit strange, isn't it ? Part of the problem is that I get the impression that clip has some behaviour which really depends on the current implementation (using addition and so on). For example, a.clip(0., 1.) does return a float32 array when a is a float32, and fails if a is an integer. There is no upcast, and I don't see any reason why; I thought that numpy was supposed to (try to ) upcast any necessary inputs when necessary ? Doing an equivalent of putmask(a, a<m. m) and putmask(a, a>M, M) is easier, but is not compatible with the current clip for many cases. this shouldn't be a problem except when clipping against another
array, which you would also want to be contiguous. In practice, I don't think using ascontiguousarray for the clipping array would add much overhead as the array would likely be contiguous in the first place. You would probably want to match data types also. I think the needed ops are already implemented for the types at the c-level, __gt__ for instance, and in that case you can simply use the function pointer with some loss of speed. If you can determine the corresponding c-type, and I think you can, then it shouldn't be too much trouble to implement type specific clipping, but it might not be worth the effort.
Having type specific clip operation is really trivial compared to all other problems: a few lines of templated code, and generating them for all basic C types. The problem really is to know when you can use those functions, and the rules for conversion while keeping the same semantics than the original clip function which are quirky in some cases, David
El ds 13 de 01 del 2007 a les 14:42 +0900, en/na David Cournapeau va escriure:
No, only if it's contiguous (I'm not sure about alignment) According to numpy ebook, you need alignement for deferencing pointers. But I guess non aligned arrays are not really common.
Unaligned arrays can be quite common if you use recarrays often: In [18]:a=numpy.empty(3, dtype="S1,i4,f8") In [19]:a.dtype.fields.items() Out[19]: [('f0', (dtype('|S1'), 0)), ('f1', (dtype('int32'), 1)), ('f2', (dtype('float64'), 5))] In [20]:f1=a['f1'] In [21]:f1.flags Out[21]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : False UPDATEIFCOPY : False And dereferencing unaligned pointers do cause a crash in general, but not in all architectures (i386 being one of them). Cheers, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
On 1/13/07, Francesc Altet <faltet@carabos.com> wrote:
El ds 13 de 01 del 2007 a les 14:42 +0900, en/na David Cournapeau va escriure:
No, only if it's contiguous (I'm not sure about alignment) According to numpy ebook, you need alignement for deferencing pointers. But I guess non aligned arrays are not really common.
Unaligned arrays can be quite common if you use recarrays often: Well, this gives me one more unit test :) I didn't know any way of creating non aligned arrays... Thank you.
cheers, David
Timothy Hochberg wrote:
On 1/11/07, *Christopher Barker* <Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>> wrote:
[CHOP]
I'd still like to know if anyone knows how to efficiently loop through all the elements of a non-contiguous array in C.
First, it's not that important if the array is contiguous for this sort of thing. What you really care about is whether it's simply-strided (or maybe single-strided would be a better term). Anyway, the last dimension of the array can be strided without making things more difficult. All you need to be able to do is to address the elements of the array as thedata[offset + stride*index].
This is right. Just remember that strides are "byte-strides" and not "element-strides" and so thedata had better be a pointer to a byte.
That being said, the strategy that the ufuncs use, and possibly other functions as well, is to have the core functions operate only on simply-strided chunks of data. At a higher level, there is some magic that parcels up non-contiguous array into simply-strided chunks and feeds them to core functions. How efficient this is obviously depends on how large the chunks that the magic parceler manages to extract are, but typically it seems to work pretty well.
This is one thing I've exposed (and made use of in more than one place) with NumPy. In Numeric, the magic was in a few lines of the ufuncobject file). Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array (it's actually the object returned by the .flat method). You can even get an iterator that iterates over one-less dimension than the array has (with the dimension using the smallest strides left "un-iterated" so that you can call an inner loop with it). You can see examples of this in several places. The basic template is: int axis=-1; /* -1 means let the code decide which axis, otherwise you can choose */ dit = (PyArrayIterObject *)PyArray_IterAllButAxis(dest, &axis); if (dit==NULL) /* error return*/ while (dit->index < dit->size) { /* do something dit->dataptr points to the first position of the first "line" PyArray_STRIDE(dest, axis) is the striding PyArray_DIM(dest, axis) is the size of the "line" */ PyArray_ITER_NEXT(dit); /* move to the next line */ } You see code like this all over in NumPy. The broadcasting of ufuncs is also exposed (so you can do this on multiple arrays). See the multi-iterator objects. -Travis -Travis
Travis Oliphant wrote:
This is one thing I've exposed (and made use of in more than one place) with NumPy. In Numeric, the magic was in a few lines of the ufuncobject file). Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array (it's actually the object returned by the .flat method). You can even get an iterator that iterates over one-less dimension than the array has (with the dimension using the smallest strides left "un-iterated" so that you can call an inner loop with it).
The thing which confuses me is whether this is useful when you only one item of one array at a time. When I was implementing some functions for LPC, I took a look at your examples for array iterators and explanations in the numpy ebook, and it looked really helpful, indeed. For this kind of code, I needed to operate on several contiguous elements at a time. But here, for cliping with scalar min and max, I only need to access to one item at a time from the input array, and that's it; in particular, I don't care about the order of iteration. So the question really boils down to: "for a numpy array a of eg float32, am I guaranteed that a->data[sizeof(float32) * i] for 0 <= i < a.size gives me all the items of a, even for non contiguous arrays ?" cheers, David
On 1/11/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Travis Oliphant wrote:
This is one thing I've exposed (and made use of in more than one place) with NumPy. In Numeric, the magic was in a few lines of the ufuncobject file). Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array (it's actually the object returned by the .flat method). You can even get an iterator that iterates over one-less dimension than the array has (with the dimension using the smallest strides left "un-iterated" so that you can call an inner loop with it).
The thing which confuses me is whether this is useful when you only one item of one array at a time. When I was implementing some functions for LPC, I took a look at your examples for array iterators and explanations in the numpy ebook, and it looked really helpful, indeed. For this kind of code, I needed to operate on several contiguous elements at a time.
But here, for cliping with scalar min and max, I only need to access to one item at a time from the input array, and that's it; in particular, I don't care about the order of iteration. So the question really boils down to:
"for a numpy array a of eg float32, am I guaranteed that a->data[sizeof(float32) * i] for 0 <= i < a.size gives me all the items of a, even for non contiguous arrays ?"
No. That is what the array iterator is for. Chuck
On 1/12/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 1/11/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Travis Oliphant wrote:
This is one thing I've exposed (and made use of in more than one
place)
with NumPy. In Numeric, the magic was in a few lines of the ufuncobject file). Now, it is exposed in the concept of an array iterator. Anybody can take advantage of it as it there is a C-API call to get an array iterator from the array (it's actually the object returned by the .flat method). You can even get an iterator that iterates over one-less dimension than the array has (with the dimension using the smallest strides left "un-iterated" so that you can call an inner loop with it). The thing which confuses me is whether this is useful when you only one item of one array at a time. When I was implementing some functions for LPC, I took a look at your examples for array iterators and explanations in the numpy ebook, and it looked really helpful, indeed. For this kind of code, I needed to operate on several contiguous elements at a time.
But here, for cliping with scalar min and max, I only need to access to one item at a time from the input array, and that's it; in particular, I
don't care about the order of iteration. So the question really boils down to:
"for a numpy array a of eg float32, am I guaranteed that a->data[sizeof(float32) * i] for 0 <= i < a.size gives me all the items of a, even for non contiguous arrays ?"
No. That is what the array iterator is for.
Although it is pretty common to make a copy of the array that *is* contiguous and pass that down. Doing so keeps life simple for the programmer and is pretty much required when interfacing to third party c and fortran routines. Chuck
Christopher Barker wrote:
autogen works well enough for me;
I didn't know about autogen -- that may be all we need.
numpy has code which already does something similar to autogen: you declare a function, and some template with a generic name, and the code generator replaces the generic name and type with some values. All the .src files in numpy/core follow this pattern.
Now, I didn't know that clip was supposed to handle arrays as min/max values.
one more nifty feature...And if you want to support broadcasting, even more so!
At first, I didn't understand the need to care about contiguous/non contiguous; having non scalar for min/max makes it necessary to have special case for non contiguous.
I'm confused. This issue is that you can't just increment the pointer to get the next element if the array is non-contiguous.. you need to do all the strides, etc, math.
Ok, so we don't mean the same thing by contiguous, and I should check that my definition is the actual one... For me, contiguous means that the array has C order, and a non contiguous array has a 'random' order, but still can go to the next element in the buffer by using standard C array addressing. In my mind, contiguous is about the relationship between the indexing of the array in C and the math indexing. According to the numpy ebook, the data buffer may: - not be aligned on word boundaries -> NPY_ALIGNED - not be native endianess -> NPY_ISNOTSWAPPED - not C contiguous (last index does not move first) -> NPY_CONTIGUOUS. I thought that as long as NPY_ALIGNED is true, you are sure that array->data[i] is the ith element of the buffer with the datatype of the array ? If the data are not aligned or not native endian, I just use the existing implementation; if you are not using the CPU endianness or alignment, you cannot expect to do things at a decent speed anyway. In my code, I differentiate alignment, endianness and scalar case. If any of this condition is not true, I just rely on the old implementation for now, which should make it easy to extend if necessary. cheers, David
Stefan van der Walt wrote:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va escriure:
Hi,
I am finally implementing a C function to replace the current slow implementation of clip in python as promised a few weeks ago. The idea is to implement it as the following:
def clip(input, min, max): a = input.copy() putmask(a, a <= min, min) putmask(a, a >= max, max) return a
I don't have that much experience in writing general numpy functions, so I was wondering of other people could advise me on the following points.
Sorry, but not real experience writing extensions directly in C. However, you may want to experiment using numexpr for doing what you want. It's relatively easy to extend numexpr and adding a new opcode to its virtual machine. I'm attaching a patch for implementing such a clip routine (only for floating point types, but given the example, it should be straightforward to add support for integers as well).
Also, you should note that using the fancy indexing of numpy seems to perform better than the putmask approach. Below are my figures for a small benchmark (also attached) for testing the performance of clip using several approaches:
time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask.
Not on my machine:
time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202 When I started looking at those things, I did the indexing method, and someone else proposed putmask, function which I was not aware of at that time. Both are similar in speed, and vastly (almost an order of magnitude on moderately sized contiguous arrays) faster than the current numpy clip.
My current C implementation does not use the equivalent of putmask. I try to determine which case are easy (basically, if the datatype is a numeric datatype and native endian), handle those directly, and for other cases (non native endian, objects, etc...), simply forwarding to the original function for now. The main difficulty is that I am not aware of all the datatypes that numpy functions are supposed to handle (for example, when I started, I didn't know that numpy could handle non native endian, which makes things a bit more complicated in C to support). cheers, David
A Dimecres 10 Gener 2007 22:49, Stefan van der Walt escrigué:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va
escriure: time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask.
Not on my machine:
time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202
Yeah, a lot of difference indeed. Just for reference, my results above were done using a Duron (an Athlon but with only 128 KB of secondary cache) at 0.9 GHz. Now, using my laptop (Intel 4 @ 2 GHz, 512 KB of secondary cache), I get: time (putmask)--> 0.244 time (where)--> 2.111 time (numexpr where)--> 0.427 time (fancy+assign)--> 0.316 time (numexpr clip)--> 0.184 so, on my laptop fancy+assign is way slower than putmask. It should be noted also that the implementation of clip in numexpr (i.e. in pure C) is not that much faster than putmask (just a 30%); so perhaps it is not so necessary to come up with a pure C implementation for clip (or at least, on Intel P4 machines!). In any case, it is really shocking seeing how differently can perform the several CPU architectures on this apparently simple problem. BTW, I'm attaching a slightly enhanced version of the clip patch for numexpr that I used for the new benchmark showed here. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Francesc Altet wrote:
A Dimecres 10 Gener 2007 22:49, Stefan van der Walt escrigué:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va
escriure: time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask. Not on my machine:
time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202
Yeah, a lot of difference indeed. Just for reference, my results above were done using a Duron (an Athlon but with only 128 KB of secondary cache) at 0.9 GHz. Now, using my laptop (Intel 4 @ 2 GHz, 512 KB of secondary cache), I get:
time (putmask)--> 0.244 time (where)--> 2.111 time (numexpr where)--> 0.427 time (fancy+assign)--> 0.316 time (numexpr clip)--> 0.184
so, on my laptop fancy+assign is way slower than putmask. It should be noted also that the implementation of clip in numexpr (i.e. in pure C) is not that much faster than putmask (just a 30%); so perhaps it is not so necessary to come up with a pure C implementation for clip (or at least, on Intel P4 machines!).
In any case, it is really shocking seeing how differently can perform the several CPU architectures on this apparently simple problem. I am not sure it is such a simple problem: it involves massive branching.
I have never taken a look a numexpr, but the idea seems really interesting, I will take a look at it when I will have some time. Anyway, I've just finished and tested a pure C implementation of the clip function. As it is, it should be able to replace PyArray_Clip calls by PyArray_FastClip in numpy/core/multiarray.c. The idea is that for 'easy' cases, it uses a trivial but fast implementation; for all other cases, it uses the old implementation for now. By easy cases, I mean scalar min and max, for non-complex number with native endianness (from npy_bool to npy_longdouble), which should cover most usages. There are still some things I am unsure: - the original clip is supposed to work with complex numbers, but I am not sure about the semantics in this case. - If you have a float32 input, but float64 min/max values, the original clip does not upcast the input. If you have integer input but floating point min/max, the original clip fails. Is this the wanted behaviour ? My implementation upcasts whenever possible; but then, I am not sure how to handle the cases where no copy is asked (which I am not handling myself for now). As for now, when PyArray_FastClip uses a fast implementation, it is roughly 5x faster for float32 and float64 input. I expect a double speed once the no copy option is implemented (again, for easy cases). I attached blop.c which implements the fast clip in a module, the clip_imp.c which implements the clipping for all native types (it is generated by autogen because I wanted to avoid depending on numpy.distutils for development), a Makefile and a test file which also profile the clip function with float32 inputs. Does it look Ok to other so that it can be commited to numpy (once the two above problems are solved, of course, to keep the same behaviour than PyArray_Clip) ? cheers, David CC = colorgcc LD = gcc PYTHONINC = -I/usr/include/python2.4 NUMPYINC = -I/home/david/local/lib/python2.4/site-packages/numpy/core/include OPTIMS = -O2 -funroll-all-loops #OPTIMS = -g -DDEBUG WARN = -W -Wall -Wstrict-prototypes -Winline -Wstrict-prototypes \ -Wmissing-prototypes -Wcast-align \ -Wcast-qual -Wnested-externs -Wshadow -Wbad-function-cast #-Wwrite-strings CFLAGS = $(PYTHONINC) $(NUMPYINC) $(OPTIMS) $(WARN) AUTOGEN = autogen _blop.so: blop.o $(LD) -o $@ -shared -Wl,-soname,$@ blop.o blop.o: blop.c clip_imp.c $(CC) -c $(CFLAGS) -fPIC $< #========================= # Autogenerated c sources #========================= clip_imp.c: clip_imp.def clip_imp.tpl $(AUTOGEN) clip_imp.def clean: rm -f *.o rm -f *.so rm -f clip_imp.c
Francesc Altet wrote:
A Dimecres 10 Gener 2007 22:49, Stefan van der Walt escrigué:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va
escriure: time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask. Not on my machine:
time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202
Yeah, a lot of difference indeed. Just for reference, my results above were done using a Duron (an Athlon but with only 128 KB of secondary cache) at 0.9 GHz. Now, using my laptop (Intel 4 @ 2 GHz, 512 KB of secondary cache), I get:
time (putmask)--> 0.244 time (where)--> 2.111 time (numexpr where)--> 0.427 time (fancy+assign)--> 0.316 time (numexpr clip)--> 0.184
so, on my laptop fancy+assign is way slower than putmask. It should be noted also that the implementation of clip in numexpr (i.e. in pure C) is not that much faster than putmask (just a 30%); so perhaps it is not so necessary to come up with a pure C implementation for clip (or at least, on Intel P4 machines!).
In any case, it is really shocking seeing how differently can perform the several CPU architectures on this apparently simple problem. I am not sure it is such a simple problem: it involves massive branching. To be more precise, you can do clipping without branching, but then the clipping is highly type and machine dependent (using bit mask and other
David Cournapeau wrote: tricks). It may worth the trouble for double, float and int, dunno. David
El dv 12 de 01 del 2007 a les 00:58 +0900, en/na David Cournapeau va escriure:
Francesc Altet wrote:
A Dimecres 10 Gener 2007 22:49, Stefan van der Walt escrigué:
On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va
escriure: time (putmask)--> 1.38 time (where)--> 2.713 time (numexpr where)--> 1.291 time (fancy+assign)--> 0.967 time (numexpr clip)--> 0.596
It is interesting to see there how fancy-indexing + assignation is quite more efficient than putmask. Not on my machine:
time (putmask)--> 0.181 time (where)--> 0.783 time (numexpr where)--> 0.26 time (fancy+assign)--> 0.202
Yeah, a lot of difference indeed. Just for reference, my results above were done using a Duron (an Athlon but with only 128 KB of secondary cache) at 0.9 GHz. Now, using my laptop (Intel 4 @ 2 GHz, 512 KB of secondary cache), I get:
time (putmask)--> 0.244 time (where)--> 2.111 time (numexpr where)--> 0.427 time (fancy+assign)--> 0.316 time (numexpr clip)--> 0.184
so, on my laptop fancy+assign is way slower than putmask. It should be noted also that the implementation of clip in numexpr (i.e. in pure C) is not that much faster than putmask (just a 30%); so perhaps it is not so necessary to come up with a pure C implementation for clip (or at least, on Intel P4 machines!).
In any case, it is really shocking seeing how differently can perform the several CPU architectures on this apparently simple problem. I am not sure it is such a simple problem: it involves massive branching. To be more precise, you can do clipping without branching, but then the clipping is highly type and machine dependent (using bit mask and other
David Cournapeau wrote: tricks). It may worth the trouble for double, float and int, dunno.
Well, I don't know which this trick can be, but just for completeness I have run the benchmark on an AMD Opteron (2 GHZ, 1 MB of secondary cache) machine, and here are the timings: time (putmask)--> 0.5 time (where)--> 1.035 time (numexpr where)--> 0.263 time (fancy+assign)--> 0.311 time (numpy clip)--> 0.704 time (numexpr clip)--> 0.089 [Note that I've added the clip from numpy. See the new benchmark attached] Curiously enough, fancy+assign is again faster than putmask. Also interesting is the fact that AMD seems to have optimized the branching in Opterons quite a lot: the processor is only 2x faster in GHz than the Duron, and most of the benchs do run 3x or 4x faster, but the numexpr clip does perform more than 6x faster (!). -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
participants (9)
-
A. M. Archibald -
Charles R Harris -
Christopher Barker -
David Cournapeau -
David Cournapeau -
Francesc Altet -
Stefan van der Walt -
Timothy Hochberg -
Travis Oliphant