Hello all,
Numpy arrays come with several reduce operations: sum(), std(), argmin(),
min(), ....
The traditional implementation of these suffers from two big problems: It is
slow and it often allocates intermediate memory. I have code that is failing
with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly
handle arrays with 100 million entries (have a couple of million objects * 20
features per object = 100 million doubles), so this is a real problem for me.
This being open-source, I decided to solve the problem. My first idea was to
try to improve the numpy code. I failed to see how to do that while
supporting everything that numpy does (multiple types, for example), so I
started an implementation of reduce operations that uses C++ templates to
make code optimised into the types it actually uses, choosing the right
version to use at run time. In the spirit or release-early/release-often, I
attach the first version of this code that runs.
BASIC IDEA
===============
ndarray.std does basically the following (The examples are in pseudo-code even
though the implementation happens to be in C):
def stddev(A):
mu = A.mean()
diff=(A-mu)
maybe_conj=(diff if not complex(A) else diff.conjugate())
diff *= maybe_conj
return diff.sum()
With a lot of temporary arrays. My code does:
def stddev(A):
mu = A.mean() # No getting around this temporary
std = 0
for i in xrange(A.size):
diff = (A[i]-mu)
if complex(A):
diff *= conjugate(diff)
else:
diff *= diff
std += diff
return sqrt(diff/A.size)
Of course, my code does it while taking into account the geometry of the
array. It handles arrays with arbitrary strides.
I do it while avoiding copying the array at any time (while most of the
existing code will transpose/copy the array so that it is well behaved in
memory).
IMPLEMENTATION
===============
I have written some template infrastructure so that, if I wanted a very fast
entropy calculation, on a normalised array, you could do:
template < ... >
void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator
results) {
while (data != past) {
if (*data) *result += *data * std::log(*data);
++data;
++result;
}
}
and the machinery will instantiate this in several variations, deciding at
run-time which one to call. You just have to write a C interface function
like
PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds)
{
int axis=NPY_MAXDIMS;
PyArray_Descr *dtype=NULL;
PyArrayObject *out=NULL;
static char *kwlist[] = {"array","axis", "dtype", "out", NULL};
if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist,
&self,
PyArray_AxisConverter,&axis,
PyArray_DescrConverter2, &dtype,
PyArray_OutputConverter, &out)) {
Py_XDECREF(dtype);
return NULL;
}
int num = _get_type_num_double(self->descr, dtype);
Py_XDECREF(dtype);
return compress_dispatch<EntropyCompute>(self, out, axis, num,
EmptyType()); // This decides which function to call
}
For contiguous arrays, axis=None, this becomes
void compute_entropy(Base* data, Base* past, Results* result) {
while (data != past) {
if (*data) *result += *data * std::log(*data);
++data;
}
}
which is probably as fast as it can be.
If the array is not contiguous, this becomes
void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past,
Results* result) {
while (data != past) {
if (*data) *result += *data * std::log(*data);
++data;
}
}
where numpy_iterator is a type that knows how to iterate over numpy arrays
following strides.
If axis is not None, then the result will not be a single value, it will be an
array. The machinery will automatically create the array of the right size
and pass it to you with so that the following gets instantiated:
void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past,
numpy_iterator<ResultType> results) {
while (data != past) {
if (*data) *results += *data * std::log(*data);
++data;
++results;
}
}
The results parameter has the strides set up correctly to iterate over
results, including looping back when necessary so that the code works as it
should.
Notice that the ++results operation seems to be dropping in and out. In fact,
I was oversimplifying above. There is no instantiation with Result*, but with
no_iteration<Result> which behaves like Result*, but with an empty operator
++(). You never change your template implementation.
(The code above was for explanatory purposes, not an example of working code.
The interface I actually used takes more parameters which are not very
important for expository purposes. This allows you to, for example, implement
the ddof parameter in std()).
ADVANTAGES
===========
For most operations, my code is faster (it's hard to beat ndarray.sum(), but
easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an
amd 64 bit machine both newer than one year (you can test it on your machine
by runnning profile.py). For some specific operations, like summing along a
non-last axis on a moderately large array, it is slower (I think that the
copy&process tactic might be better in this case than the no-copy/one pass
operation I am using). In most cases I tested, it is faster. In particular,
the basic case (a well-behaved array), it is faster.
More important than speed (at least for me) is the fact that this does not
allocate more memory than needed. This will not fail with OOM errors.
It's easy to implement specific optimisations. For example, replace a sum
function for a specific case to call AMD's framewave SIMD library (which uses
SIMD instructions):
void compute_sum(short* data, short* past, no_iteration<short> result) {
fwiSum_16s_C1R ( data, sizeof(short), past-start, &*result);
}
or, compute the standard deviation of an array of boolean with a single pass
(sigma = sqrt(p(1-p))):
void compute_std(bool* data, bool* past, no_iteration<ResType> result) {
size_type N = (past-data);
size_type pos = 0;
while (data != past) {
if (*data) ++pos;
++data;
}
*result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N)));
}
NOT IMPLEMENTED (YET)
======================
Non-aligned arrays are not implemented.
out arrays have to be well behaved.
My current idea is to compromise and make copies in this case. I could also,
trivially, write a thing that handled those cases without copying, but I
don't think it's worth the cost in code bloat.
* argmax()/argmin(). This is a bit harder to implement than the rest, as it
needs a bit more machinery. I think it's possible, but I haven't gotten
around to it.
* ptp(). I hesitate whether to simply do it in Python:
def ncr_ptp(A,axis=None,dtype=None,out=None):
res=ncr.max(A,axis=axis,dtype=dtype,out=out)
min=ncr.min(A,axis=axis,dtype=dtype)
res -= max
return res
Does two passes over the data, but no extra copying. I could do the same using
the Python array API, of course.
DISADVANTAGES
==============
It's C++. I don't see it as a technical disadvantage, but I understand some
people might not like it. If this was used in the numpy code base, it could
replace the current macro language (begin repeat // end repeat), so the total
number of languages used would not increase.
It generates too many functions. Disk is cheap, but do we really need a well
optimised version of std() for the case of complex inputs and boolean output?
What's a boolean standard deviation anyway?. Maybe some tradeoff is
necessary: optimise the defaults and make others possible.
BUGS
=====
I am not correctly implementing the dtype parameter. I thought it controlled
the type of the intermediate results, but if I do
import numpy
A=numpy.arange(10)
A.mean(dtype=numpy.int8)
I get
4.5
which surprises me!
Furthermore,
A.mean(dtype=numpy.int8).dtype
returns
dtype('float64')
In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4
(This might be considered a numpy bug --- BTW, i am not running subversion for
this comparison).
* I also return only either python longs or python floats. This is a trivial
change, I just don't know all the right function names to return numpy types.
* A couple of places in the code still have a FIXME on them
FUTURE
=======
I consider this code proof-of-concept. It's cool, it demonstrate that the idea
works, but it is rough and needs cleaning up. There might even be bugs in it
(especially the C interface with Python is something I am not so familiar
with)!
I see three possible paths for this:
(1) You agree that this is nice, it achieves good results and I should strive
to integrate this into numpy proper, replacing the current implementation. I
would start a branch in numpy svn and implement it there and finally merge
into the main branch. I would be perfectly happy to relicense to BSD if this
is the case.
One could think of the array-to-array (A+B, A+=2, A != B,...) operations
using a similar scheme
This would imply that part of the C API would be obsoleted: the whole
functions array would stop making sense. I don't know how widespread it's
used.
(2) I keep maintaining this as a separate package and whoever wants to use
it, can use it. I will probably keep it GPL, for now.
Of course, at some later point in time, one could integrate this into the
main branch (Again, in this case, I am happy to relicense).
(3) Someone decides to take this as a challenge and re-implements ndarray.std
and the like so that it uses less memory and is faster, but does it still in
C. I am not much of a C programmer, so I don't see how this could be done
without really ugly reliance on the preprocessor, but maybe it could be done.
What do you think?
bye,
Luís Pedro Coelho
PhD Student in Computational Biology