
Dear numpy experts, I could not find a satisfying solution to the following problem, so I thought I would ask: In one part of a large program I have to deal a lot with small (2d or 3d) vectors and matrices, performing simple linear algebra operations with them (dot products and matrix multiplications). For several reasons I want the code to be in python. The most natural choice seemed to be to use numpy. However, the constant time cost when dealing with numpy arrays seems to be immense, as demonstrated by this toy program: **************************************************************** import numpy as np from time import time def points_tuple(radius): rr = radius**2 def inside(point): return point[0]**2 + point[1]**2 < rr M = ((1, 0), (0, 1)) for x in xrange(-radius, radius + 1): for y in xrange(-radius, radius + 1): point = (M[0][0] * x + M[0][1] * y, M[1][0] * x + M[1][1] * y) if inside(point): yield point def points_numpy(radius): rr = radius**2 def inside(point): return np.dot(point, point) < rr M = np.identity(2, dtype=int) for x in xrange(-radius, radius + 1): for y in xrange(-radius, radius + 1): point = np.dot(M, (x, y)) if inside(point): yield point def main(): r = 200 for func in [points_tuple, points_numpy]: t = time() for point in func(r): pass print func.__name__, time() - t, 'seconds' if __name__ == '__main__': main() **************************************************************** On my trusty old machine the output (python 2.6, numpy 1.5.1) is: points_tuple 0.36815404892 seconds points_numpy 6.20338392258 seconds I do not need C performance here, but the latter is definitely too slow. In the real program it's not so easy (but possible) to use tuples because the code is dimension-independent. Before considering writing an own "small vector" module, I'd like about other possible solutions. Other people must have stumbled across this before! Thanks, Christoph

Here's a version that uses less Python loops and thus is faster. What still takes time is the array creation (np.array(...)), I'm not sure exactly why. It may be possible to speed it up. def points_numpy(radius): rr = radius**2 M = np.identity(2, dtype=int) x_y = np.array(list(itertools.product(xrange(-radius, radius + 1), xrange(-radius, radius + 1)))) x_y_M = np.dot(x_y, M) is_inside = (x_y_M ** 2).sum(axis=1) < rr for point in x_y_M[is_inside]: yield point -=- Olivier 2011/10/11 Christoph Groth <cwg@falma.de>
Dear numpy experts,
I could not find a satisfying solution to the following problem, so I thought I would ask:
In one part of a large program I have to deal a lot with small (2d or 3d) vectors and matrices, performing simple linear algebra operations with them (dot products and matrix multiplications). For several reasons I want the code to be in python.
The most natural choice seemed to be to use numpy. However, the constant time cost when dealing with numpy arrays seems to be immense, as demonstrated by this toy program:
**************************************************************** import numpy as np from time import time
def points_tuple(radius): rr = radius**2 def inside(point): return point[0]**2 + point[1]**2 < rr
M = ((1, 0), (0, 1)) for x in xrange(-radius, radius + 1): for y in xrange(-radius, radius + 1): point = (M[0][0] * x + M[0][1] * y, M[1][0] * x + M[1][1] * y) if inside(point): yield point
def points_numpy(radius): rr = radius**2 def inside(point): return np.dot(point, point) < rr
M = np.identity(2, dtype=int) for x in xrange(-radius, radius + 1): for y in xrange(-radius, radius + 1): point = np.dot(M, (x, y)) if inside(point): yield point
def main(): r = 200 for func in [points_tuple, points_numpy]: t = time() for point in func(r): pass print func.__name__, time() - t, 'seconds'
if __name__ == '__main__': main() ****************************************************************
On my trusty old machine the output (python 2.6, numpy 1.5.1) is:
points_tuple 0.36815404892 seconds points_numpy 6.20338392258 seconds
I do not need C performance here, but the latter is definitely too slow.
In the real program it's not so easy (but possible) to use tuples because the code is dimension-independent. Before considering writing an own "small vector" module, I'd like about other possible solutions. Other people must have stumbled across this before!
Thanks, Christoph
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Olivier Delalleau <shish@keba.be> writes:
Here's a version that uses less Python loops and thus is faster. What still takes time is the array creation (np.array(...)), I'm not sure exactly why. It may be possible to speed it up.
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example. Any other ideas?

11.10.2011 14:14, Christoph Groth kirjoitti: [clip]
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help. -- Pauli Virtanen

Pauli Virtanen <pav@iki.fi> writes:
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpy-like library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically. The actual problem I'm trying to solve is finding those points of a n-dimensional lattice which belong to an implicitly given shape. The input is a lattice (specified in the most simple case by n n-dimensional vectors, i.e. a n-by-n matrix), a starting point on that lattice, and a shape function which returns True if a point belongs to the shape, and False if it does not. The output is an iterable over the lattice points which belong to the shape. To generate the output, the algorithm (flood-fill) recursively examines the starting point and its neighbors, calling for each of them the shape function. There are various variants of this algorithm, but all of them rely on the same basic operations. To my knowledge, it is not possible to vectorize this algorithm using numpy. One can vectorize it if a bounding box for the shape is known in advance, but this is not very efficient as all the lattice points inside the bounding box are checked. Christoph

On Tue, Oct 11, 2011 at 11:57 AM, Christoph Groth <cwg@falma.de> wrote:
Pauli Virtanen <pav@iki.fi> writes:
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpy-like library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. You might gain some speed if you drop it down into Cython, some examples in that thread. If you're still up against it, you can try the C code that Fernando posted for fast matrix multiplication (I haven't yet), or you might be able to do well to use tokyo from Cython since Wes' has fixed it up [2]. I'd be very interested to hear if you achieve a great speed-up with cython+tokyo. Cheers, Skipper [1] http://mail.scipy.org/pipermail/scipy-user/2010-December/thread.html#27791 [2] https://github.com/wesm/tokyo

2011/10/11 Skipper Seabold <jsseabold@gmail.com>
On Tue, Oct 11, 2011 at 11:57 AM, Christoph Groth <cwg@falma.de> wrote:
Pauli Virtanen <pav@iki.fi> writes:
Thank you for your suggestion. It doesn't help me however, because the algorithm I'm _really_ trying to speed up cannot be vectorized with numpy in the way you vectorized my toy example.
Any other ideas?
Reformulate the problem so that it can be vectorized. Without knowing more about the actual algorithm you are trying to implement, it's not easy to give more detailed help.
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpy-like library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. You might gain some speed if you drop it down into Cython, some examples in that thread. If you're still up against it, you can try the C code that Fernando posted for fast matrix multiplication (I haven't yet), or you might be able to do well to use tokyo from Cython since Wes' has fixed it up [2].
I'd be very interested to hear if you achieve a great speed-up with cython+tokyo.
Cheers,
Skipper
[1] http://mail.scipy.org/pipermail/scipy-user/2010-December/thread.html#27791 [2] https://github.com/wesm/tokyo
Another idea would be to use Theano ( http://deeplearning.net/software/theano/). It's a bit overkill though and you would need to express most of your algorithm in a symbolic way to be able to take advantage of it. You would then be able to write your own C code to do the array operations that are too slow when relying on numpy. If you are interested in pursuing this direction though, let me know and I can give you a few pointers. -=- Olivier

Skipper Seabold <jsseabold@gmail.com> writes:
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...)
Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
I'd be very interested to hear if you achieve a great speed-up with cython+tokyo.
I try to solve this problem in some way or other. I'll post here if I end up with something interesting. Christoph

On Tue, Oct 11, 2011 at 12:41 PM, Christoph Groth <cwg@falma.de> wrote:
Skipper Seabold <jsseabold@gmail.com> writes:
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...)
Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
IIUC, it's not so much the BLAS that's helpful but avoiding the overhead in calling numpy.dot from cython.
I'd be very interested to hear if you achieve a great speed-up with cython+tokyo.
I try to solve this problem in some way or other. I'll post here if I end up with something interesting.
Please do. Skipper

On Tue, Oct 11, 2011 at 12:41 PM, Christoph Groth<cwg@falma.de> wrote:
Skipper Seabold<jsseabold@gmail.com> writes:
So it's the dot function being called repeatedly on smallish arrays that's the bottleneck? I've run into this as well. See this thread [1]. (...) Thanks for the links. "tokyo" is interesting, though I fear the intermediate matrix size regime where it really makes a difference will be rather small. My concern is in really tiny vectors, where it's not even worth to call BLAS.
IIUC, it's not so much the BLAS that's helpful but avoiding the overhead in calling numpy.dot from cython.
I'd be very interested to hear if you achieve a great speed-up with cython+tokyo. I try to solve this problem in some way or other. I'll post here if I end up with something interesting. Please do.
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion In the example, M is an identity 2 by 2 array. This creates a lot of overhead in creating arrays from a tuple followed by two dot operations. But the tuple code is not exactly equivalent because M is 'expanded' into a single dimension to avoid some of the unnecessary multiplications. Thus the tuple code is already a different algorithm
On 10/11/2011 12:06 PM, Skipper Seabold wrote: than the numpy code so the comparison is not really correct. All that is needed here for looping over scalar values of x, y and radius is to evaluate (x*x + y*y) < radius**2 That could probably be done with array multiplication and broadcasting. Bruce

11.10.2011 17:57, Christoph Groth kirjoitti: [clip]
My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpy-like library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
I'm not aware of such a library. Writing one e.g. with Cython should be quite straightforward, however. [clip]
To generate the output, the algorithm (flood-fill) recursively examines the starting point and its neighbors, calling for each of them the shape function. There are various variants of this algorithm, but all of them rely on the same basic operations.
To my knowledge, it is not possible to vectorize this algorithm using numpy. One can vectorize it if a bounding box for the shape is known in advance, but this is not very efficient as all the lattice points inside the bounding box are checked.
The only way to vectorize this I see is to do write the floodfill algorithm on rectangular supercells, so that the constant costs are amortized. Sounds a bit messy to do, though. -- Pauli Virtanen

My question was about ways to achieve a speedup without modifying the algorithm. I was hoping that there is some numpy-like library for python which for small arrays achieves a performance at least on par with the implementation using tuples. This should be possible technically.
I'm not aware of such a library. Writing one e.g. with Cython should be quite straightforward, however.
That's what I'll probably end up doing...

Any other ideas?
I'm not an expert at all, but I far as I understand if you cannot vectorize your problem, numpy is not the best tool to use if the speed matter a bit. Of course it's not a realistic example, but a simple loop computing a cosine is 3-4 time slower using numpy cos than python math cos (which is already slower than IDL).
participants (6)
-
Bruce Southey
-
Christoph Groth
-
Jean-Luc Menut
-
Olivier Delalleau
-
Pauli Virtanen
-
Skipper Seabold