[Numpy-discussion] On loop and broadcasting (again)

David Cournapeau david at ar.media.kyoto-u.ac.jp
Thu Oct 5 09:37:33 EDT 2006


Hi,

   The email from Albert made me look again on some surprising results I 
got a few months ago when starting my first "serious" numpy project. I 
noticed that when computing multivariate gaussian densities, centering 
the data was more expensive than everything else, including 
exponentiation. Now that I have some experience with numpy, and 
following the previous discussion, I tried the following script:

import numpy
from numpy.random import randn

import numpy as N

def center_broadcast(X, mu):
   return X - mu

def center_broadcast2(X, mu):
   return N.transpose(X.T - mu.T)

def center_repmat(X, mu):
   n   = X.shape[0]
   return X - N.repmat(mu, n, 1)

def center_outer(X, mu):
   n   = X.shape[0]
   return X - N.outer(N.ones(n), mu)

def center_dot(X, mu):
   n   = X.shape[0]
   return X - N.dot(N.ones((n, 1)), mu[N.newaxis, :])

def center_manual(X, mu):
   return X - mu

def bench():
   n       = 1e5
   d       = 30
   niter   = 10

   X   = randn(n, d)
   mu  = randn(d)
   va  = randn(d) ** 2

   mur = N.repmat(mu, n, 1)

   for i in range(niter):
       y1  = center_broadcast(X, mu)
       y2  = center_repmat(X, mu)
       y3  = center_dot(X, mu)
       y4  = center_outer(X, mu)
       y5  = center_manual(X, mur)
       y6  = center_broadcast2(X, mur)

if __name__ == '__main__':
   import hotshot, hotshot.stats
   profile_file    = 'storage.prof'
   prof    = hotshot.Profile(profile_file, lineevents=1)
   prof.runcall(bench)
   p = hotshot.stats.load(profile_file)
   print p.sort_stats('cumulative').print_stats(20)
   prof.close()


If X is an array containing n samples of d dimension,  and mu an array 
with d elements, I want to compute x - mu, for each row x of X. The idea 
is to compare timing of "manual broadcasting" using repmat vs 
broadcasting, and storage influence:
- center_broadcast expect mu to be a d elements array, and uses the 
numpy broadcast rules
- center_broadcast2 expect same args, but tranpose to force a C-like 
storage.
- center_repmat uses repmat instead of the automatic broadcast to make 
mu a (n, d) array with n times the same row
- center_outer and center_dot implements manually the repmat using 
matrix product (under matlab, this was much faster on large matrices)
- center_manual expects mu to be already in a (n, d) form with n times 
the same row: this is to see the cost of the substraction itself.

The results are the following:

      10    4.204    0.420    4.204    0.420 
storage_email.py:8(center_broadcast)
      10    0.475    0.048    3.449    0.345 
storage_email.py:18(center_outer)
      10    2.964    0.296    2.964    0.296 
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:227(outer)
      10    2.767    0.277    2.768    0.277 
storage_email.py:11(center_broadcast2)
      10    0.485    0.049    1.288    0.129 
storage_email.py:14(center_repmat)
      10    1.217    0.122    1.227    0.123 
storage_email.py:22(center_dot)
      11    0.883    0.080    0.883    0.080 
/usr/lib/python2.4/site-packages/numpy/lib/shape_base.py:522(repmat)
      10    0.457    0.046    0.457    0.046 
storage_email.py:26(center_manual)
       5    0.137    0.027    0.137    0.027 
/usr/lib/python2.4/site-packages/numpy/core/fromnumeric.py:375(sum)
      20    0.020    0.001    0.020    0.001 
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:527(ones)
      31    0.000    0.000    0.000    0.000 
/usr/lib/python2.4/site-packages/numpy/core/numeric.py:125(asarray)
      10    0.000    0.000    0.000    0.000 
/usr/lib/python2.4/site-packages/numpy/core/fromnumeric.py:116(transpose)

As you can see, there is a quite a difference between implementations: 
using repmat is the fastest, with repmat taking twice as much time as 
the substraction itself. Replacing repmat with dot does not give any 
advantage (I guess the underlying implementation uses the same core C 
functions ?). Is it expected than automatic broadcast is so much 
expensive ? And why using tranpose gives a 40% speed improvement ? What 
makes automatic broadcast so expensive compared to repmat ?

For what it worths, the substraction itself is as fast as in matlab, 
whereas the repmat is twice slower (this quick script is not meant as a 
comparison against matlab of course, but I just wanted to check how 
matlab behaves in those situations),

David




More information about the NumPy-Discussion mailing list