Dear numpy experts, I could not find a satisfying solution to the following problem, so I thought I would ask: In one part of a large program I have to deal a lot with small (2d or 3d) vectors and matrices, performing simple linear algebra operations with them (dot products and matrix multiplications). For several reasons I want the code to be in python. The most natural choice seemed to be to use numpy. However, the constant time cost when dealing with numpy arrays seems to be immense, as demonstrated by this toy program: **************************************************************** import numpy as np from time import time def points_tuple(radius): rr = radius**2 def inside(point): return point[0]**2 + point[1]**2 < rr M = ((1, 0), (0, 1)) for x in xrange(radius, radius + 1): for y in xrange(radius, radius + 1): point = (M[0][0] * x + M[0][1] * y, M[1][0] * x + M[1][1] * y) if inside(point): yield point def points_numpy(radius): rr = radius**2 def inside(point): return np.dot(point, point) < rr M = np.identity(2, dtype=int) for x in xrange(radius, radius + 1): for y in xrange(radius, radius + 1): point = np.dot(M, (x, y)) if inside(point): yield point def main(): r = 200 for func in [points_tuple, points_numpy]: t = time() for point in func(r): pass print func.__name__, time()  t, 'seconds' if __name__ == '__main__': main() **************************************************************** On my trusty old machine the output (python 2.6, numpy 1.5.1) is: points_tuple 0.36815404892 seconds points_numpy 6.20338392258 seconds I do not need C performance here, but the latter is definitely too slow. In the real program it's not so easy (but possible) to use tuples because the code is dimensionindependent. Before considering writing an own "small vector" module, I'd like about other possible solutions. Other people must have stumbled across this before! Thanks, Christoph
Here's a version that uses less Python loops and thus is faster. What still
takes time is the array creation (np.array(...)), I'm not sure exactly why.
It may be possible to speed it up.
def points_numpy(radius):
rr = radius**2
M = np.identity(2, dtype=int)
x_y = np.array(list(itertools.product(xrange(radius, radius + 1),
xrange(radius, radius + 1))))
x_y_M = np.dot(x_y, M)
is_inside = (x_y_M ** 2).sum(axis=1) < rr
for point in x_y_M[is_inside]:
yield point
= Olivier
2011/10/11 Christoph Groth
Dear numpy experts,
I could not find a satisfying solution to the following problem, so I thought I would ask:
In one part of a large program I have to deal a lot with small (2d or 3d) vectors and matrices, performing simple linear algebra operations with them (dot products and matrix multiplications). For several reasons I want the code to be in python.
The most natural choice seemed to be to use numpy. However, the constant time cost when dealing with numpy arrays seems to be immense, as demonstrated by this toy program:
**************************************************************** import numpy as np from time import time
def points_tuple(radius): rr = radius**2 def inside(point): return point[0]**2 + point[1]**2 < rr
M = ((1, 0), (0, 1)) for x in xrange(radius, radius + 1): for y in xrange(radius, radius + 1): point = (M[0][0] * x + M[0][1] * y, M[1][0] * x + M[1][1] * y) if inside(point): yield point
def points_numpy(radius): rr = radius**2 def inside(point): return np.dot(point, point) < rr
M = np.identity(2, dtype=int) for x in xrange(radius, radius + 1): for y in xrange(radius, radius + 1): point = np.dot(M, (x, y)) if inside(point): yield point
def main(): r = 200 for func in [points_tuple, points_numpy]: t = time() for point in func(r): pass print func.__name__, time()  t, 'seconds'
if __name__ == '__main__': main() ****************************************************************
On my trusty old machine the output (python 2.6, numpy 1.5.1) is:
points_tuple 0.36815404892 seconds points_numpy 6.20338392258 seconds
I do not need C performance here, but the latter is definitely too slow.
In the real program it's not so easy (but possible) to use tuples because the code is dimensionindependent. Before considering writing an own "small vector" module, I'd like about other possible solutions. Other people must have stumbled across this before!
Thanks, Christoph
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Olivier Delalleau
Here's a version that uses less Python loops and thus is faster. What still takes time is the array creation (np.array(...)), I'm not sure exactly why. It may be possible to speed it up.
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example. Any other ideas?
11.10.2011 14:14, Christoph Groth kirjoitti: [clip]
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.  Pauli Virtanen
Pauli Virtanen
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpylike library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically. The actual problem I'm trying to solve is finding those points of a ndimensional lattice which belong to an implicitly given shape. The input is a lattice (specified in the most simple case by n ndimensional vectors, i.e. a nbyn matrix), a starting point on that lattice, and a shape function which returns True if a point belongs to the shape, and False if it does not. The output is an iterable over the lattice points which belong to the shape. To generate the output, the algorithm (floodfill) recursively examines the starting point and its neighbors, calling for each of them the shape function. There are various variants of this algorithm, but all of them rely on the same basic operations. To my knowledge, it is not possible to vectorize this algorithm using numpy. One can vectorize it if a bounding box for the shape is known in advance, but this is not very efficient as all the lattice points inside the bounding box are checked. Christoph
On Tue, Oct 11, 2011 at 11:57 AM, Christoph Groth
Pauli Virtanen
writes: Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpylike library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. You might gain some speed if you drop it down into Cython, some examples in that thread. If you're still up against it, you can try the C code that Fernando posted for fast matrix multiplication (I haven't yet), or you might be able to do well to use tokyo from Cython since Wes' has fixed it up [2]. I'd be very interested to hear if you achieve a great speedup with cython+tokyo. Cheers, Skipper [1] http://mail.scipy.org/pipermail/scipyuser/2010December/thread.html#27791 [2] https://github.com/wesm/tokyo
2011/10/11 Skipper Seabold
On Tue, Oct 11, 2011 at 11:57 AM, Christoph Groth
wrote: Pauli Virtanen
writes: Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpylike library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. You might gain some speed if you drop it down into Cython, some examples in that thread. If you're still up against it, you can try the C code that Fernando posted for fast matrix multiplication (I haven't yet), or you might be able to do well to use tokyo from Cython since Wes' has fixed it up [2].
I'd be very interested to hear if you achieve a great speedup with cython+tokyo.
Cheers,
Skipper
[1] http://mail.scipy.org/pipermail/scipyuser/2010December/thread.html#27791 [2] https://github.com/wesm/tokyo
Another idea would be to use Theano ( http://deeplearning.net/software/theano/). It's a bit overkill though and you would need to express most of your algorithm in a symbolic way to be able to take advantage of it. You would then be able to write your own C code to do the array operations that are too slow when relying on numpy. If you are interested in pursuing this direction though, let me know and I can give you a few pointers. = Olivier
Skipper Seabold
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...)
Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
I'd be very interested to hear if you achieve a great speedup with cython+tokyo.
I try to solve this problem in some way or other. I'll post here if I end up with something interesting. Christoph
On Tue, Oct 11, 2011 at 12:41 PM, Christoph Groth
Skipper Seabold
writes: So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...)
Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
IIUC, it's not so much the BLAS that's helpful but avoiding the overhead in calling numpy.dot from cython.
I'd be very interested to hear if you achieve a great speedup with cython+tokyo.
I try to solve this problem in some way or other. I'll post here if I end up with something interesting.
Please do. Skipper
On Tue, Oct 11, 2011 at 12:41 PM, Christoph Groth
wrote: Skipper Seabold
writes: So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...) Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
IIUC, it's not so much the BLAS that's helpful but avoiding the overhead in calling numpy.dot from cython.
I'd be very interested to hear if you achieve a great speedup with cython+tokyo. I try to solve this problem in some way or other. I'll post here if I end up with something interesting. Please do.
Skipper _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion In the example, M is an identity 2 by 2 array. This creates a lot of overhead in creating arrays from a tuple followed by two dot operations. But the tuple code is not exactly equivalent because M is 'expanded' into a single dimension to avoid some of the unnecessary multiplications. Thus the tuple code is already a different algorithm
On 10/11/2011 12:06 PM, Skipper Seabold wrote: than the numpy code so the comparison is not really correct. All that is needed here for looping over scalar values of x, y and radius is to evaluate (x*x + y*y) < radius**2 That could probably be done with array multiplication and broadcasting. Bruce
11.10.2011 17:57, Christoph Groth kirjoitti: [clip]
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpylike library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
I'm not aware of such a library. Writing one e.g. with Cython should be quite straightforward, however. [clip]
To generate the output, the algorithm (floodfill) recursively examines the starting point and its neighbors, calling for each of them the shape function. There are various variants of this algorithm, but all of them rely on the same basic operations.
To my knowledge, it is not possible to vectorize this algorithm using numpy. One can vectorize it if a bounding box for the shape is known in advance, but this is not very efficient as all the lattice points inside the bounding box are checked.
The only way to vectorize this I see is to do write the floodfill algorithm on rectangular supercells, so that the constant costs are amortized. Sounds a bit messy to do, though.  Pauli Virtanen
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpylike library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
I'm not aware of such a library. Writing one e.g. with Cython should be quite straightforward, however.
That's what I'll probably end up doing...
Any other ideas?
I'm not an expert at all, but I far as I understand if you cannot vectorize your problem, numpy is not the best tool to use if the speed matter a bit. Of course it's not a realistic example, but a simple loop computing a cosine is 34 time slower using numpy cos than python math cos (which is already slower than IDL).
participants (6)

Bruce Southey

Christoph Groth

JeanLuc Menut

Olivier Delalleau

Pauli Virtanen

Skipper Seabold