No Copy Reduce Operations
Hello all, Numpy arrays come with several reduce operations: sum(), std(), argmin(), min(), .... The traditional implementation of these suffers from two big problems: It is slow and it often allocates intermediate memory. I have code that is failing with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly handle arrays with 100 million entries (have a couple of million objects * 20 features per object = 100 million doubles), so this is a real problem for me. This being open-source, I decided to solve the problem. My first idea was to try to improve the numpy code. I failed to see how to do that while supporting everything that numpy does (multiple types, for example), so I started an implementation of reduce operations that uses C++ templates to make code optimised into the types it actually uses, choosing the right version to use at run time. In the spirit or release-early/release-often, I attach the first version of this code that runs. BASIC IDEA =============== ndarray.std does basically the following (The examples are in pseudo-code even though the implementation happens to be in C): def stddev(A): mu = A.mean() diff=(A-mu) maybe_conj=(diff if not complex(A) else diff.conjugate()) diff *= maybe_conj return diff.sum() With a lot of temporary arrays. My code does: def stddev(A): mu = A.mean() # No getting around this temporary std = 0 for i in xrange(A.size): diff = (A[i]-mu) if complex(A): diff *= conjugate(diff) else: diff *= diff std += diff return sqrt(diff/A.size) Of course, my code does it while taking into account the geometry of the array. It handles arrays with arbitrary strides. I do it while avoiding copying the array at any time (while most of the existing code will transpose/copy the array so that it is well behaved in memory). IMPLEMENTATION =============== I have written some template infrastructure so that, if I wanted a very fast entropy calculation, on a normalised array, you could do: template < ... > void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator results) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; ++result; } } and the machinery will instantiate this in several variations, deciding at run-time which one to call. You just have to write a C interface function like PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds) { int axis=NPY_MAXDIMS; PyArray_Descr *dtype=NULL; PyArrayObject *out=NULL; static char *kwlist[] = {"array","axis", "dtype", "out", NULL}; if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist, &self, PyArray_AxisConverter,&axis, PyArray_DescrConverter2, &dtype, PyArray_OutputConverter, &out)) { Py_XDECREF(dtype); return NULL; } int num = _get_type_num_double(self->descr, dtype); Py_XDECREF(dtype); return compress_dispatch<EntropyCompute>(self, out, axis, num, EmptyType()); // This decides which function to call } For contiguous arrays, axis=None, this becomes void compute_entropy(Base* data, Base* past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } } which is probably as fast as it can be. If the array is not contiguous, this becomes void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, Results* result) { while (data != past) { if (*data) *result += *data * std::log(*data); ++data; } } where numpy_iterator is a type that knows how to iterate over numpy arrays following strides. If axis is not None, then the result will not be a single value, it will be an array. The machinery will automatically create the array of the right size and pass it to you with so that the following gets instantiated: void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, numpy_iterator<ResultType> results) { while (data != past) { if (*data) *results += *data * std::log(*data); ++data; ++results; } } The results parameter has the strides set up correctly to iterate over results, including looping back when necessary so that the code works as it should. Notice that the ++results operation seems to be dropping in and out. In fact, I was oversimplifying above. There is no instantiation with Result*, but with no_iteration<Result> which behaves like Result*, but with an empty operator ++(). You never change your template implementation. (The code above was for explanatory purposes, not an example of working code. The interface I actually used takes more parameters which are not very important for expository purposes. This allows you to, for example, implement the ddof parameter in std()). ADVANTAGES =========== For most operations, my code is faster (it's hard to beat ndarray.sum(), but easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an amd 64 bit machine both newer than one year (you can test it on your machine by runnning profile.py). For some specific operations, like summing along a non-last axis on a moderately large array, it is slower (I think that the copy&process tactic might be better in this case than the no-copy/one pass operation I am using). In most cases I tested, it is faster. In particular, the basic case (a well-behaved array), it is faster. More important than speed (at least for me) is the fact that this does not allocate more memory than needed. This will not fail with OOM errors. It's easy to implement specific optimisations. For example, replace a sum function for a specific case to call AMD's framewave SIMD library (which uses SIMD instructions): void compute_sum(short* data, short* past, no_iteration<short> result) { fwiSum_16s_C1R ( data, sizeof(short), past-start, &*result); } or, compute the standard deviation of an array of boolean with a single pass (sigma = sqrt(p(1-p))): void compute_std(bool* data, bool* past, no_iteration<ResType> result) { size_type N = (past-data); size_type pos = 0; while (data != past) { if (*data) ++pos; ++data; } *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N))); } NOT IMPLEMENTED (YET) ====================== Non-aligned arrays are not implemented. out arrays have to be well behaved. My current idea is to compromise and make copies in this case. I could also, trivially, write a thing that handled those cases without copying, but I don't think it's worth the cost in code bloat. * argmax()/argmin(). This is a bit harder to implement than the rest, as it needs a bit more machinery. I think it's possible, but I haven't gotten around to it. * ptp(). I hesitate whether to simply do it in Python: def ncr_ptp(A,axis=None,dtype=None,out=None): res=ncr.max(A,axis=axis,dtype=dtype,out=out) min=ncr.min(A,axis=axis,dtype=dtype) res -= max return res Does two passes over the data, but no extra copying. I could do the same using the Python array API, of course. DISADVANTAGES ============== It's C++. I don't see it as a technical disadvantage, but I understand some people might not like it. If this was used in the numpy code base, it could replace the current macro language (begin repeat // end repeat), so the total number of languages used would not increase. It generates too many functions. Disk is cheap, but do we really need a well optimised version of std() for the case of complex inputs and boolean output? What's a boolean standard deviation anyway?. Maybe some tradeoff is necessary: optimise the defaults and make others possible. BUGS ===== I am not correctly implementing the dtype parameter. I thought it controlled the type of the intermediate results, but if I do import numpy A=numpy.arange(10) A.mean(dtype=numpy.int8) I get 4.5 which surprises me! Furthermore, A.mean(dtype=numpy.int8).dtype returns dtype('float64') In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4 (This might be considered a numpy bug --- BTW, i am not running subversion for this comparison). * I also return only either python longs or python floats. This is a trivial change, I just don't know all the right function names to return numpy types. * A couple of places in the code still have a FIXME on them FUTURE ======= I consider this code proof-of-concept. It's cool, it demonstrate that the idea works, but it is rough and needs cleaning up. There might even be bugs in it (especially the C interface with Python is something I am not so familiar with)! I see three possible paths for this: (1) You agree that this is nice, it achieves good results and I should strive to integrate this into numpy proper, replacing the current implementation. I would start a branch in numpy svn and implement it there and finally merge into the main branch. I would be perfectly happy to relicense to BSD if this is the case. One could think of the array-to-array (A+B, A+=2, A != B,...) operations using a similar scheme This would imply that part of the C API would be obsoleted: the whole functions array would stop making sense. I don't know how widespread it's used. (2) I keep maintaining this as a separate package and whoever wants to use it, can use it. I will probably keep it GPL, for now. Of course, at some later point in time, one could integrate this into the main branch (Again, in this case, I am happy to relicense). (3) Someone decides to take this as a challenge and re-implements ndarray.std and the like so that it uses less memory and is faster, but does it still in C. I am not much of a C programmer, so I don't see how this could be done without really ugly reliance on the preprocessor, but maybe it could be done. What do you think? bye, Luís Pedro Coelho PhD Student in Computational Biology
participants (3)
-
Bruce Southey
-
Luis Pedro Coelho
-
Travis E. Oliphant