[Numpy-discussion] Vectorizing code, for loops, and all that

Mon Oct 2 18:00:10 EDT 2006

Hello all

I recently started looking at David Cournapeau's PyEM package,
specifically his implementation of the K-Means algorithm. He
implemented part of this algorithm with in pure Python version and
also provided a Pyrex alternative that is significantly faster (about
10 times with the data I tested with). I tried to figure out why this
is so.

The job of this part of the algorithm is pretty simple: from a bunch
of cluster means (the codebook) find the nearest cluster mean for each
data point. The first attempt at implementing this algorithm might use
two for loops, one over the data points and one over the cluster
means, computing a Euclidean distance at each step. Slightly better is
to use one loop and some broadcasting.

By randomly trying some code, I arrived at the following one-liner:

N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1)

where

data = N.random.randn(2000, 39)
nclusters = 64
code = data[0:nclusters,:]

This code was still slow, so I mangled it a bit so that I could
profile it using cProfile under Python 2.5 (use profile if you have an
older version of Python):

def kmean():
    data = N.random.randn(2000, 39)
    nclusters = 64
    code = data[0:nclusters,:]
    for i in xrange(10):
        def a(): return data[...,N.newaxis,...]
        z = a()
        def b(): return z-code
        z = b()
        def c(): return z**2
        z = c()
        def d(): return N.sum(z, 2)
        z = d()
        def e(): return z.argmin(axis=1)
        z = e()

if __name__ == '__main__':
    import cProfile
    cProfile.run("kmean()")

Profiler output:

In [570]: %run -t -e kmean2.py
         84 function calls in 8.567 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.567    8.567 <string>:1(<module>)
       10    0.000    0.000    0.835    0.084 fromnumeric.py:375(sum)
       10    0.000    0.000    0.000    0.000 kmean2.py:10(a)
       10    6.837    0.684    6.837    0.684 kmean2.py:12(b)
       10    0.623    0.062    0.623    0.062 kmean2.py:14(c)
       10    0.000    0.000    0.835    0.084 kmean2.py:16(d)
       10    0.080    0.008    0.080    0.008 kmean2.py:18(e)
        1    0.180    0.180    8.567    8.567 kmean2.py:5(kmean2)
       10    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of
'_lsprof.Profiler' objects}
        1    0.012    0.012    0.012    0.012 {method 'randn' of
'mtrand.RandomState' objects}
       10    0.835    0.083    0.835    0.083 {method 'sum' of
'numpy.ndarray' objects}

It seems the b part of the computation is taking most of the time. I
tried to simulate this separately:

In [571]: x1 = N.random.randn(2000,39)

In [572]: y1 = N.random.randn(64,39)

In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703
ms per loop

In [575]: z1.shape
Out[575]: (2000, 64, 39)

As far as I can figure, this operation is doing 2000*64*39
subtractions. Doing this straight up yields the following:

In [576]: x2 = N.random.randn(2000,64,39)

In [577]: y2 = N.random.randn(2000,64,39)

In [578]: %timeit z2 = x2-y2
10 loops, best of 3: 108 ms per loop

Does anybody have any ideas on why this is so much faster? Hopefully I
didn't mess up somewhere...

Thanks!

Regards,

Albert

P.S. I'm used NumPy 1.0rc1 with Python 2.5 on Windows for these tests.
I get similar results with 1.0.dev3239 with Python 2.4.3 on Linux.