No Copy Reduce Operations

Hello all, Numpy arrays come with several reduce operations: sum(), std(), argmin(), min(), .... The traditional implementation of these suffers from two big problems: It is slow and it often allocates intermediate memory. I have code that is failing with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly handle arrays with 100 million entries (have a couple of million objects * 20 features per object = 100 million doubles), so this is a real problem for me. This being open-source, I decided to solve the problem. My first idea was to try to improve the numpy code. I failed to see how to do that while supporting everything that numpy does (multiple types, for example), so I started an implementation of reduce operations that uses C++ templates to make code optimised into the types it actually uses, choosing the right version to use at run time. In the spirit or release-early/release-often, I attach the first version of this code that runs. BASIC IDEA =============== ndarray.std does basically the following (The examples are in pseudo-code even though the implementation happens to be in C): def stddev(A): mu = A.mean() diff=(A-mu) maybe_conj=(diff if not complex(A) else diff.conjugate()) diff *= maybe_conj return diff.sum() With a lot of temporary arrays. My code does: def stddev(A): mu = A.mean() # No getting around this temporary std = 0 for i in xrange(A.size): diff = (A[i]-mu) if complex(A): diff *= conjugate(diff) else: diff *= diff std += diff return sqrt(diff/A.size) Of course, my code does it while taking into account the geometry of the array. It handles arrays with arbitrary strides. I do it while avoiding copying the array at any time (while most of the existing code will transpose/copy the array so that it is well behaved in memory). IMPLEMENTATION =============== I have written some template infrastructure so that, if I wanted a very fast entropy calculation, on a normalised array, you could do: template < ... > void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator results) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; ++result; } } and the machinery will instantiate this in several variations, deciding at run-time which one to call. You just have to write a C interface function like PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds) { int axis=NPY_MAXDIMS; PyArray_Descr *dtype=NULL; PyArrayObject *out=NULL; static char *kwlist[] = {"array","axis", "dtype", "out", NULL}; if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist, &self, PyArray_AxisConverter,&axis, PyArray_DescrConverter2, &dtype, PyArray_OutputConverter, &out)) { Py_XDECREF(dtype); return NULL; } int num = _get_type_num_double(self->descr, dtype); Py_XDECREF(dtype); return compress_dispatch<EntropyCompute>(self, out, axis, num, EmptyType()); // This decides which function to call } For contiguous arrays, axis=None, this becomes void compute_entropy(Base* data, Base* past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } } which is probably as fast as it can be. If the array is not contiguous, this becomes void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } } where numpy_iterator is a type that knows how to iterate over numpy arrays following strides. If axis is not None, then the result will not be a single value, it will be an array. The machinery will automatically create the array of the right size and pass it to you with so that the following gets instantiated: void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, numpy_iterator<ResultType> results) { while (data != past) { if (*data) *results += *data * std::log(*data); ++data; ++results; } } The results parameter has the strides set up correctly to iterate over results, including looping back when necessary so that the code works as it should. Notice that the ++results operation seems to be dropping in and out. In fact, I was oversimplifying above. There is no instantiation with Result*, but with no_iteration<Result> which behaves like Result*, but with an empty operator ++(). You never change your template implementation. (The code above was for explanatory purposes, not an example of working code. The interface I actually used takes more parameters which are not very important for expository purposes. This allows you to, for example, implement the ddof parameter in std()). ADVANTAGES =========== For most operations, my code is faster (it's hard to beat ndarray.sum(), but easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an amd 64 bit machine both newer than one year (you can test it on your machine by runnning profile.py). For some specific operations, like summing along a non-last axis on a moderately large array, it is slower (I think that the copy&process tactic might be better in this case than the no-copy/one pass operation I am using). In most cases I tested, it is faster. In particular, the basic case (a well-behaved array), it is faster. More important than speed (at least for me) is the fact that this does not allocate more memory than needed. This will not fail with OOM errors. It's easy to implement specific optimisations. For example, replace a sum function for a specific case to call AMD's framewave SIMD library (which uses SIMD instructions): void compute_sum(short* data, short* past, no_iteration<short> result) { fwiSum_16s_C1R ( data, sizeof(short), past-start, &*result); } or, compute the standard deviation of an array of boolean with a single pass (sigma = sqrt(p(1-p))): void compute_std(bool* data, bool* past, no_iteration<ResType> result) { size_type N = (past-data); size_type pos = 0; while (data != past) { if (*data) ++pos; ++data; } *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N))); } NOT IMPLEMENTED (YET) ====================== Non-aligned arrays are not implemented. out arrays have to be well behaved. My current idea is to compromise and make copies in this case. I could also, trivially, write a thing that handled those cases without copying, but I don't think it's worth the cost in code bloat. * argmax()/argmin(). This is a bit harder to implement than the rest, as it needs a bit more machinery. I think it's possible, but I haven't gotten around to it. * ptp(). I hesitate whether to simply do it in Python: def ncr_ptp(A,axis=None,dtype=None,out=None): res=ncr.max(A,axis=axis,dtype=dtype,out=out) min=ncr.min(A,axis=axis,dtype=dtype) res -= max return res Does two passes over the data, but no extra copying. I could do the same using the Python array API, of course. DISADVANTAGES ============== It's C++. I don't see it as a technical disadvantage, but I understand some people might not like it. If this was used in the numpy code base, it could replace the current macro language (begin repeat // end repeat), so the total number of languages used would not increase. It generates too many functions. Disk is cheap, but do we really need a well optimised version of std() for the case of complex inputs and boolean output? What's a boolean standard deviation anyway?. Maybe some tradeoff is necessary: optimise the defaults and make others possible. BUGS ===== I am not correctly implementing the dtype parameter. I thought it controlled the type of the intermediate results, but if I do import numpy A=numpy.arange(10) A.mean(dtype=numpy.int8) I get 4.5 which surprises me! Furthermore, A.mean(dtype=numpy.int8).dtype returns dtype('float64') In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4 (This might be considered a numpy bug --- BTW, i am not running subversion for this comparison). * I also return only either python longs or python floats. This is a trivial change, I just don't know all the right function names to return numpy types. * A couple of places in the code still have a FIXME on them FUTURE ======= I consider this code proof-of-concept. It's cool, it demonstrate that the idea works, but it is rough and needs cleaning up. There might even be bugs in it (especially the C interface with Python is something I am not so familiar with)! I see three possible paths for this: (1) You agree that this is nice, it achieves good results and I should strive to integrate this into numpy proper, replacing the current implementation. I would start a branch in numpy svn and implement it there and finally merge into the main branch. I would be perfectly happy to relicense to BSD if this is the case. One could think of the array-to-array (A+B, A+=2, A != B,...) operations using a similar scheme This would imply that part of the C API would be obsoleted: the whole functions array would stop making sense. I don't know how widespread it's used. (2) I keep maintaining this as a separate package and whoever wants to use it, can use it. I will probably keep it GPL, for now. Of course, at some later point in time, one could integrate this into the main branch (Again, in this case, I am happy to relicense). (3) Someone decides to take this as a challenge and re-implements ndarray.std and the like so that it uses less memory and is faster, but does it still in C. I am not much of a C programmer, so I don't see how this could be done without really ugly reliance on the preprocessor, but maybe it could be done. What do you think? bye, Luís Pedro Coelho PhD Student in Computational Biology

Luis Pedro Coelho wrote:
Hello all,
Numpy arrays come with several reduce operations: sum(), std(), argmin(), min(), ....
The traditional implementation of these suffers from two big problems: It is slow and it often allocates intermediate memory. I have code that is failing with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly handle arrays with 100 million entries (have a couple of million objects * 20 features per object = 100 million doubles), so this is a real problem for me.
Luis, Thank you for your work and your enthusiasm. You are absolutely right that the default implementations are generic and therefore potentially slower and more memory consuming. There is generally a basic tradeoff between generic code and fast/memory-conserving code. The default implementations of std and var are much different than sum, min, argmin, etc. The main difference is that the latter are direct reduce methods on the ufuncs while the former are generic extensions using "python written with the Python C-API." Your approach using C++ templates is interesting, and I'm very glad for your explanation and your releasing of the code as open source. I'm not prepared to start using C++ in NumPy, however, so your code will have to serve as an example only. One way to do this without using templates is to extend the dtype functions array with additional function pointers (std, and var). This has been done several times in the past and it is probably advisable. In that case your code could very likely be used (using the C-compatible portions). I'm grateful you are willing to re-license that part of your code as BSD so that it can possibly be used in NumPy. Thanks so much. It is exciting to see interest in the code especially from students. Best regards, -Travis P.S. Specific answers to some of your questions below.
I am not correctly implementing the dtype parameter. I thought it controlled the type of the intermediate results, but if I do
The dtype parameter controls the type of the "reduction" but not the final result (which is always a float for the mean because of the division).
I see three possible paths for this: (1) You agree that this is nice, it achieves good results and I should strive to integrate this into numpy proper, replacing the current implementation. I would start a branch in numpy svn and implement it there and finally merge into the main branch. I would be perfectly happy to relicense to BSD if this is the case.
The idea of your code is great, but the C++ implementation cannot be directly used.
One could think of the array-to-array (A+B, A+=2, A != B,...) operations using a similar scheme This would imply that part of the C API would be obsoleted: the whole functions array would stop making sense. I don't know how widespread it's used.
I don't know what you mean by the "functions array". If you are talking about the ufuncs, then yes it is widespread.

Luis Pedro Coelho wrote:
Hello all,
Numpy arrays come with several reduce operations: sum(), std(), argmin(), min(), ....
The traditional implementation of these suffers from two big problems: It is slow and it often allocates intermediate memory. I have code that is failing with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly handle arrays with 100 million entries (have a couple of million objects * 20 features per object = 100 million doubles), so this is a real problem for me.
This being open-source, I decided to solve the problem. My first idea was to try to improve the numpy code. I failed to see how to do that while supporting everything that numpy does (multiple types, for example), so I started an implementation of reduce operations that uses C++ templates to make code optimised into the types it actually uses, choosing the right version to use at run time. In the spirit or release-early/release-often, I attach the first version of this code that runs.
BASIC IDEA ===============
ndarray.std does basically the following (The examples are in pseudo-code even though the implementation happens to be in C):
def stddev(A): mu = A.mean() diff=(A-mu) maybe_conj=(diff if not complex(A) else diff.conjugate()) diff *= maybe_conj return diff.sum()
With a lot of temporary arrays. My code does:
def stddev(A): mu = A.mean() # No getting around this temporary std = 0 for i in xrange(A.size): diff = (A[i]-mu) if complex(A): diff *= conjugate(diff) else: diff *= diff std += diff return sqrt(diff/A.size)
Of course, my code does it while taking into account the geometry of the array. It handles arrays with arbitrary strides.
I do it while avoiding copying the array at any time (while most of the existing code will transpose/copy the array so that it is well behaved in memory).
IMPLEMENTATION ===============
I have written some template infrastructure so that, if I wanted a very fast entropy calculation, on a normalised array, you could do:
template < ... > void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator results) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; ++result; } }
and the machinery will instantiate this in several variations, deciding at run-time which one to call. You just have to write a C interface function like
PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds) { int axis=NPY_MAXDIMS; PyArray_Descr *dtype=NULL; PyArrayObject *out=NULL; static char *kwlist[] = {"array","axis", "dtype", "out", NULL};
if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist, &self, PyArray_AxisConverter,&axis, PyArray_DescrConverter2, &dtype, PyArray_OutputConverter, &out)) { Py_XDECREF(dtype); return NULL; }
int num = _get_type_num_double(self->descr, dtype); Py_XDECREF(dtype); return compress_dispatch<EntropyCompute>(self, out, axis, num, EmptyType()); // This decides which function to call }
For contiguous arrays, axis=None, this becomes
void compute_entropy(Base* data, Base* past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } }
which is probably as fast as it can be.
If the array is not contiguous, this becomes
void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } }
where numpy_iterator is a type that knows how to iterate over numpy arrays following strides.
If axis is not None, then the result will not be a single value, it will be an array. The machinery will automatically create the array of the right size and pass it to you with so that the following gets instantiated:
void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, numpy_iterator<ResultType> results) { while (data != past) { if (*data) *results += *data * std::log(*data); ++data; ++results; } }
The results parameter has the strides set up correctly to iterate over results, including looping back when necessary so that the code works as it should.
Notice that the ++results operation seems to be dropping in and out. In fact, I was oversimplifying above. There is no instantiation with Result*, but with no_iteration<Result> which behaves like Result*, but with an empty operator ++(). You never change your template implementation.
(The code above was for explanatory purposes, not an example of working code. The interface I actually used takes more parameters which are not very important for expository purposes. This allows you to, for example, implement the ddof parameter in std()).
ADVANTAGES ===========
For most operations, my code is faster (it's hard to beat ndarray.sum(), but easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an amd 64 bit machine both newer than one year (you can test it on your machine by runnning profile.py). For some specific operations, like summing along a non-last axis on a moderately large array, it is slower (I think that the copy&process tactic might be better in this case than the no-copy/one pass operation I am using). In most cases I tested, it is faster. In particular, the basic case (a well-behaved array), it is faster.
More important than speed (at least for me) is the fact that this does not allocate more memory than needed. This will not fail with OOM errors.
It's easy to implement specific optimisations. For example, replace a sum function for a specific case to call AMD's framewave SIMD library (which uses SIMD instructions):
void compute_sum(short* data, short* past, no_iteration<short> result) { fwiSum_16s_C1R ( data, sizeof(short), past-start, &*result); }
or, compute the standard deviation of an array of boolean with a single pass (sigma = sqrt(p(1-p))):
void compute_std(bool* data, bool* past, no_iteration<ResType> result) { size_type N = (past-data); size_type pos = 0; while (data != past) { if (*data) ++pos; ++data; } *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N))); }
NOT IMPLEMENTED (YET) ======================
Non-aligned arrays are not implemented. out arrays have to be well behaved.
My current idea is to compromise and make copies in this case. I could also, trivially, write a thing that handled those cases without copying, but I don't think it's worth the cost in code bloat.
* argmax()/argmin(). This is a bit harder to implement than the rest, as it needs a bit more machinery. I think it's possible, but I haven't gotten around to it.
* ptp(). I hesitate whether to simply do it in Python:
def ncr_ptp(A,axis=None,dtype=None,out=None): res=ncr.max(A,axis=axis,dtype=dtype,out=out) min=ncr.min(A,axis=axis,dtype=dtype) res -= max return res
Does two passes over the data, but no extra copying. I could do the same using the Python array API, of course.
DISADVANTAGES ==============
It's C++. I don't see it as a technical disadvantage, but I understand some people might not like it. If this was used in the numpy code base, it could replace the current macro language (begin repeat // end repeat), so the total number of languages used would not increase.
It generates too many functions. Disk is cheap, but do we really need a well optimised version of std() for the case of complex inputs and boolean output? What's a boolean standard deviation anyway?. Maybe some tradeoff is necessary: optimise the defaults and make others possible.
BUGS =====
I am not correctly implementing the dtype parameter. I thought it controlled the type of the intermediate results, but if I do
import numpy A=numpy.arange(10) A.mean(dtype=numpy.int8)
I get
4.5
which surprises me!
Furthermore,
A.mean(dtype=numpy.int8).dtype
returns
dtype('float64')
In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4
(This might be considered a numpy bug --- BTW, i am not running subversion for this comparison).
* I also return only either python longs or python floats. This is a trivial change, I just don't know all the right function names to return numpy types.
* A couple of places in the code still have a FIXME on them
FUTURE =======
I consider this code proof-of-concept. It's cool, it demonstrate that the idea works, but it is rough and needs cleaning up. There might even be bugs in it (especially the C interface with Python is something I am not so familiar with)!
I see three possible paths for this: (1) You agree that this is nice, it achieves good results and I should strive to integrate this into numpy proper, replacing the current implementation. I would start a branch in numpy svn and implement it there and finally merge into the main branch. I would be perfectly happy to relicense to BSD if this is the case. One could think of the array-to-array (A+B, A+=2, A != B,...) operations using a similar scheme This would imply that part of the C API would be obsoleted: the whole functions array would stop making sense. I don't know how widespread it's used.
(2) I keep maintaining this as a separate package and whoever wants to use it, can use it. I will probably keep it GPL, for now. Of course, at some later point in time, one could integrate this into the main branch (Again, in this case, I am happy to relicense).
(3) Someone decides to take this as a challenge and re-implements ndarray.std and the like so that it uses less memory and is faster, but does it still in C. I am not much of a C programmer, so I don't see how this could be done without really ugly reliance on the preprocessor, but maybe it could be done.
What do you think?
bye, Luís Pedro Coelho PhD Student in Computational Biology
------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion FYI for computing variance (and hence std) you probably should be using Knuth's (or Welford's) one pass approach (On-line algorithm) to avoid recomputing the mean: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Also for large arrays, you may also want to maximize the precision to avoid potential overflow. Bruce
participants (3)
-
Bruce Southey
-
Luis Pedro Coelho
-
Travis E. Oliphant